| 2025-07-21 |
Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems |
Andrii Balashov et.al. |
2507.15613 |
null |
| 2025-07-21 |
QSAF: A Novel Mitigation Framework for Cognitive Degradation in Agentic AI |
Hammad Atta et.al. |
2507.15330 |
null |
| 2025-07-21 |
PromptArmor: Simple yet Effective Prompt Injection Defenses |
Tianneng Shi et.al. |
2507.15219 |
null |
| 2025-07-20 |
DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection |
Jerry Wang et.al. |
2507.15042 |
null |
| 2025-07-20 |
AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning |
Yi Zhang et.al. |
2507.14987 |
null |
| 2025-07-20 |
Hierarchical Cross-modal Prompt Learning for Vision-Language Models |
Hao Zheng et.al. |
2507.14976 |
null |
| 2025-07-20 |
Strategic Integration of AI Chatbots in Physics Teacher Preparation: A TPACK-SWOT Analysis of Pedagogical, Epistemic, and Cybersecurity Dimensions |
N. Mohammadipour et.al. |
2507.14860 |
null |
| 2025-07-20 |
Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree |
Sam Johnson et.al. |
2507.14799 |
null |
| 2025-07-18 |
Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models |
Palash Nandi et.al. |
2507.13761 |
null |
| 2025-07-18 |
TopicAttack: An Indirect Prompt Injection Attack via Topic Transition |
Yulin Chen et.al. |
2507.13686 |
null |
| 2025-07-17 |
Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers |
Liang Lin et.al. |
2507.13474 |
null |
| 2025-07-17 |
Prompt Injection 2.0: Hybrid AI Threats |
Jeremy McHugh et.al. |
2507.13169 |
null |
| 2025-07-17 |
MAD-Spear: A Conformity-Driven Prompt Injection Attack on Multi-Agent Debate Systems |
Yu Cui et.al. |
2507.13038 |
null |
| 2025-07-16 |
Exploiting Jailbreaking Vulnerabilities in Generative AI to Bypass Ethical Safeguards for Facilitating Phishing Attacks |
Rina Mishra et.al. |
2507.12185 |
null |
| 2025-07-16 |
LLMs Encode Harmfulness and Refusal Separately |
Jiachen Zhao et.al. |
2507.11878 |
null |
| 2025-07-15 |
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility |
Brendan Murphy et.al. |
2507.11630 |
null |
| 2025-07-14 |
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning |
Zhengyue Zhao et.al. |
2507.11500 |
null |
| 2025-07-15 |
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs |
Zichen Wen et.al. |
2507.11097 |
null |
| 2025-07-17 |
SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems |
Wenliang Shan et.al. |
2507.08898 |
null |
| 2025-07-10 |
A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking |
Zhengye Han et.al. |
2507.08207 |
null |
| 2025-07-10 |
Defending Against Prompt Injection With a Few DefensiveTokens |
Sizhe Chen et.al. |
2507.07974 |
null |
| 2025-07-10 |
GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing |
Peiyan Zhang et.al. |
2507.07735 |
null |
| 2025-07-10 |
May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks |
Nishit V. Pandya et.al. |
2507.07417 |
null |
| 2025-07-09 |
An attention-aware GNN-based input defender against multi-turn jailbreak on LLMs |
Zixuan Huang et.al. |
2507.07146 |
null |
| 2025-07-11 |
The Dark Side of LLMs Agent-based Attacks for Complete Computer Takeover |
Matteo Lupinacci et.al. |
2507.06850 |
null |
| 2025-07-09 |
On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks |
Stephen Obadinma et.al. |
2507.06489 |
null |
| 2025-07-09 |
Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models |
Aaron Dharna et.al. |
2507.06466 |
null |
| 2025-07-08 |
Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms |
Tarek Gasmi et.al. |
2507.06323 |
null |
| 2025-07-08 |
The bitter lesson of misuse detection |
Hadrien Mariaccia et.al. |
2507.06282 |
null |
| 2025-07-08 |
Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review |
Zhicheng Lin et.al. |
2507.06185 |
null |
| 2025-07-08 |
CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations |
Xiaohu Li et.al. |
2507.06043 |
null |
| 2025-07-08 |
TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data |
Aravind Cheruvu et.al. |
2507.05660 |
null |
| 2025-07-08 |
How Not to Detect Prompt Injections with an LLM |
Sarthak Choudhary et.al. |
2507.05630 |
null |
| 2025-07-07 |
A Systematization of Security Vulnerabilities in Computer Use Agents |
Daniel Jones et.al. |
2507.05445 |
null |
| 2025-07-07 |
Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models |
Ziqi Miao et.al. |
2507.05248 |
null |
| 2025-07-07 |
Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message |
Wei Duan et.al. |
2507.04673 |
null |
| 2025-07-06 |
Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking |
Tim Beyer et.al. |
2507.04446 |
null |
| 2025-07-06 |
Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs |
Xiaomeng Hu et.al. |
2507.04365 |
null |
| 2025-07-04 |
On Jailbreaking Quantized Language Models Through Fault Injection Attacks |
Noureldin Zahran et.al. |
2507.03236 |
null |
| 2025-07-03 |
Adversarial Manipulation of Reasoning Models using Internal Representations |
Kureha Yamaguchi et.al. |
2507.03167 |
null |
| 2025-07-03 |
LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users |
Almog Hilel et.al. |
2507.02850 |
null |
| 2025-07-03 |
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection |
Ziqi Miao et.al. |
2507.02844 |
null |
| 2025-07-03 |
Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models |
Riccardo Cantini et.al. |
2507.02799 |
null |
| 2025-07-03 |
Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks |
Sizhe Chen et.al. |
2507.02735 |
null |
| 2025-07-03 |
PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage |
Krishna Kanth Nakka et.al. |
2507.02332 |
null |
| 2025-07-02 |
MGC: A Compiler Framework Exploiting Compositional Blindness in Aligned LLMs for Malware Generation |
Lu Yan et.al. |
2507.02057 |
null |
| 2025-07-02 |
SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism |
Beitao Chen et.al. |
2507.01513 |
null |
| 2025-07-01 |
Reasoning as an Adaptive Defense for Safety |
Taeyoun Kim et.al. |
2507.00971 |
null |
| 2025-07-01 |
SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents |
Siyuan Liang et.al. |
2507.00841 |
null |
| 2025-07-02 |
Transferable Modeling Strategies for Low-Resource LLM Tasks: A Prompt and Alignment-Based Approach |
Shuangquan Lyu et.al. |
2507.00601 |
null |
| 2025-06-30 |
Linearly Decoding Refused Knowledge in Aligned Language Models |
Aryan Shrivastava et.al. |
2507.00239 |
null |
| 2025-06-30 |
Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models |
Tung-Ling Li et.al. |
2506.24056 |
null |
| 2025-06-30 |
Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages |
Ruhina Tabasshum Prome et.al. |
2506.23930 |
null |
| 2025-06-30 |
Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models |
Maria Carolina Cornelia Wit et.al. |
2506.23576 |
null |
| 2025-06-29 |
From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows |
Mohamed Amine Ferrag et.al. |
2506.23260 |
null |
| 2025-06-28 |
Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models |
Younwoo Choi et.al. |
2506.22957 |
null |
| 2025-06-27 |
VERA: Variational Inference Framework for Jailbreaking Large Language Models |
Anamika Lochab et.al. |
2506.22666 |
null |
| 2025-06-27 |
MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs |
Boyuan Chen et.al. |
2506.22557 |
null |
| 2025-07-01 |
Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center |
James Wen et.al. |
2506.22523 |
null |
| 2025-06-27 |
A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety |
Camille François et.al. |
2506.22183 |
null |
| 2025-06-27 |
Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses |
Mohamed Ahmed et.al. |
2506.21972 |
null |
| 2025-06-24 |
PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty |
Jinwen He et.al. |
2506.19563 |
null |
| 2025-06-24 |
MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models |
Yinan Xia et.al. |
2506.19257 |
null |
| 2025-06-23 |
Command-V: Pasting LLM Behaviors via Activation Profiles |
Barry Wang et.al. |
2506.19140 |
null |
| 2025-06-23 |
Enhancing Security in LLM Applications: A Performance Evaluation of Early Detection Systems |
Valerii Gakh et.al. |
2506.19109 |
null |
| 2025-06-23 |
Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks |
Xiaodong Wu et.al. |
2506.18543 |
null |
| 2025-06-23 |
NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation |
Yu Xie et.al. |
2506.18325 |
null |
| 2025-06-22 |
Multi-turn Jailbreaking via Global Refinement and Active Fabrication |
Hua Tang et.al. |
2506.17881 |
null |
| 2025-06-20 |
Semantic-Aware Parsing for Security Logs |
Julien Piet et.al. |
2506.17512 |
null |
| 2025-06-20 |
From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers |
Jingtong Su et.al. |
2506.17052 |
null |
| 2025-06-20 |
MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning |
Muyang Zheng et.al. |
2506.16792 |
null |
| 2025-06-20 |
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models |
Lei Jiang et.al. |
2506.16760 |
null |
| 2025-06-19 |
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models |
Biao Yi et.al. |
2506.16447 |
null |
| 2025-06-19 |
Probing the Robustness of Large Language Models Safety to Latent Perturbations |
Tianle Gu et.al. |
2506.16078 |
link |
| 2025-06-18 |
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts |
Kartik Sharma et.al. |
2506.15751 |
null |
| 2025-06-18 |
Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers |
Tommaso Green et.al. |
2506.15674 |
link |
| 2025-06-18 |
From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem |
Yanxu Mao et.al. |
2506.15170 |
null |
| 2025-06-17 |
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents |
Thomas Kuntz et.al. |
2506.14866 |
link |
| 2025-06-17 |
AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models |
Ads Dawson et.al. |
2506.14682 |
link |
| 2025-06-16 |
Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations |
Abhilekh Borah et.al. |
2506.13901 |
null |
| 2025-06-17 |
Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions |
Junfeng Jiao et.al. |
2506.13510 |
link |
| 2025-06-15 |
Jailbreak Strength and Model Similarity Predict Transferability |
Rico Angell et.al. |
2506.12913 |
null |
| 2025-06-15 |
Universal Jailbreak Suffixes Are Strong Attention Hijackers |
Matan Ben-Tov et.al. |
2506.12880 |
link |
| 2025-06-15 |
SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression |
Yucheng Li et.al. |
2506.12707 |
null |
| 2025-06-15 |
Alphabet Index Mapping: Jailbreaking LLMs through Semantic Dissimilarity |
Bilal Saleh Husain et.al. |
2506.12685 |
null |
| 2025-06-14 |
Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025 |
Zonghao Ying et.al. |
2506.12430 |
link |
| 2025-06-14 |
Exploring the Secondary Risks of Large Language Models |
Jiawei Chen et.al. |
2506.12382 |
null |
| 2025-06-14 |
QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety |
Taegyeong Lee et.al. |
2506.12299 |
null |
| 2025-06-13 |
InfoFlood: Jailbreaking Large Language Models with Information Overload |
Advait Yadav et.al. |
2506.12274 |
null |
| 2025-06-13 |
Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models |
Jinming Wen et.al. |
2506.11521 |
null |
| 2025-06-12 |
How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? |
Sohee Yang et.al. |
2506.10979 |
null |
| 2025-06-12 |
SoK: Evaluating Jailbreak Guardrails for Large Language Models |
Xunguang Wang et.al. |
2506.10597 |
link |
| 2025-06-10 |
Evaluation empirique de la sécurisation et de l'alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks |
Rafaël Nouailles et.al. |
2506.10029 |
null |
| 2025-06-09 |
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges |
Haoyang Li et.al. |
2506.10022 |
link |
| 2025-06-11 |
LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge |
Sahar Abdelnabi et.al. |
2506.09956 |
link |
| 2025-06-11 |
Effective Red-Teaming of Policy-Adherent Agents |
Itay Nakash et.al. |
2506.09600 |
null |
| 2025-06-11 |
AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI) |
Danush Khanna et.al. |
2506.08885 |
null |
| 2025-06-11 |
Design Patterns for Securing LLM Agents against Prompt Injections |
Luca Beurer-Kellner et.al. |
2506.08837 |
null |
| 2025-06-09 |
TokenBreak: Bypassing Text Classification Models Through Token Manipulation |
Kasimir Schulz et.al. |
2506.07948 |
null |
| 2025-06-11 |
RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards |
Jingnan Zheng et.al. |
2506.07736 |
null |
| 2025-06-09 |
Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models |
Maciej ChrabÄ…szcz et.al. |
2506.07645 |
null |
| 2025-06-09 |
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts |
Torsten KrauĂź et.al. |
2506.07596 |
null |
| 2025-06-09 |
When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment |
Yuxin Xiao et.al. |
2506.07452 |
link |
| 2025-06-09 |
Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures |
Yukai Zhou et.al. |
2506.07402 |
null |
| 2025-06-08 |
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint |
Leheng Sheng et.al. |
2506.07022 |
link |
| 2025-06-10 |
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test |
Xiaoyuan Zhu et.al. |
2506.06975 |
null |
| 2025-06-06 |
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance |
Ruizhong Qiu et.al. |
2506.06444 |
link |
| 2025-06-06 |
Small Models, Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG |
Zarreen Reza et.al. |
2506.05925 |
null |
| 2025-06-06 |
To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt |
Zhilong Wang et.al. |
2506.05739 |
null |
| 2025-06-05 |
Sentinel: SOTA model to protect against prompt injections |
Dror Ivry et.al. |
2506.05446 |
null |
| 2025-06-05 |
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets |
Lei Hsiung et.al. |
2506.05346 |
null |
| 2025-06-05 |
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model |
Youngwan Lee et.al. |
2506.04704 |
null |
| 2025-06-06 |
TracLLM: A Generic Framework for Attributing Long Context LLMs |
Yanting Wang et.al. |
2506.04202 |
link |
| 2025-06-03 |
Adversarial Attacks on Robotic Vision Language Action Models |
Eliot Krzysztof Jones et.al. |
2506.03350 |
link |
| 2025-06-03 |
It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics |
Matthew Kowal et.al. |
2506.02873 |
null |
| 2025-06-03 |
ATAG: AI-Agent Application Threat Assessment with Attack Graphs |
Parth Atulbhai Gandhi et.al. |
2506.02859 |
null |
| 2025-06-03 |
From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV |
Yousef Emami et.al. |
2506.02649 |
null |
| 2025-06-03 |
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage |
Kalyan Nakka et.al. |
2506.02479 |
link |
| 2025-06-03 |
VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents |
Tri Cao et.al. |
2506.02456 |
link |
| 2025-06-02 |
ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs |
Zeming Wei et.al. |
2506.01770 |
link |
| 2025-06-02 |
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models |
Youze Wang et.al. |
2506.01307 |
null |
| 2025-06-01 |
Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution |
Meysam Alizadeh et.al. |
2506.01055 |
null |
| 2025-06-01 |
Predicting Empirical AI Research Outcomes with Language Models |
Jiaxin Wen et.al. |
2506.00794 |
null |
| 2025-06-01 |
Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning |
Weiyang Guo et.al. |
2506.00782 |
null |
| 2025-05-30 |
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis |
Xiaorui Wu et.al. |
2505.24672 |
link |
| 2025-05-30 |
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization |
Utsav Maskey et.al. |
2505.24621 |
null |
| 2025-05-30 |
AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders |
Yuqi Zhang et.al. |
2505.24519 |
null |
| 2025-05-30 |
Model Unlearning via Sparse Autoencoder Subspace Guided Projections |
Xu Wang et.al. |
2505.24428 |
null |
| 2025-05-30 |
From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models |
Haibo Jin et.al. |
2505.24232 |
null |
| 2025-05-30 |
SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems |
Xu He et.al. |
2505.24201 |
null |
| 2025-05-29 |
LLM Agents Should Employ Security Principles |
Kaiyuan Zhang et.al. |
2505.24019 |
null |
| 2025-05-29 |
Securing AI Agents with Information-Flow Control |
Manuel Costa et.al. |
2505.23643 |
link |
| 2025-05-29 |
Understanding Refusal in Language Models with Sparse Autoencoders |
Wei Jie Yeo et.al. |
2505.23556 |
link |
| 2025-05-29 |
Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models |
Mingyu Yu et.al. |
2505.23404 |
null |
| 2025-05-28 |
Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment |
Krti Tallam et.al. |
2505.22852 |
null |
| 2025-05-28 |
Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing |
Yifan Lu et.al. |
2505.22298 |
null |
| 2025-05-28 |
Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models |
Yongcan Yu et.al. |
2505.22271 |
null |
| 2025-05-28 |
Jailbreak Distillation: Renewable Safety Benchmarking |
Jingyu Zhang et.al. |
2505.22037 |
null |
| 2025-05-28 |
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments |
Zeyi Liao et.al. |
2505.21936 |
link |
| 2025-05-27 |
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation |
Tharindu Kumarage et.al. |
2505.21784 |
null |
| 2025-05-26 |
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts |
Hee-Seon Kim et.al. |
2505.21556 |
null |
| 2025-05-28 |
Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space |
Yao Huang et.al. |
2505.21277 |
link |
| 2025-05-27 |
Improved Representation Steering for Language Models |
Zhengxuan Wu et.al. |
2505.20809 |
link |
| 2025-05-26 |
Holes in Latent Space: Topological Signatures Under Adversarial Influence |
Aideen Fay et.al. |
2505.20435 |
null |
| 2025-05-26 |
Lifelong Safety Alignment for Language Models |
Haoyu Wang et.al. |
2505.20259 |
link |
| 2025-05-26 |
Capability-Based Scaling Laws for LLM Red-Teaming |
Alexander Panfilov et.al. |
2505.20162 |
link |
| 2025-05-26 |
Attention! You Vision Language Model Could Be Maliciously Manipulated |
Xiaosen Wang et.al. |
2505.19911 |
null |
| 2025-05-26 |
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs |
Sangyeop Kim et.al. |
2505.19773 |
null |
| 2025-05-26 |
SGM: A Framework for Building Specification-Guided Moderation Filters |
Masoomali Fatehkia et.al. |
2505.19766 |
null |
| 2025-05-26 |
VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models |
Bingrui Sima et.al. |
2505.19684 |
null |
| 2025-05-26 |
JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models |
Jiaxin Song et.al. |
2505.19610 |
null |
| 2025-05-25 |
GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization |
Zixuan Chen et.al. |
2505.18979 |
null |
| 2025-05-25 |
Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations |
Sanjay Kariyappa et.al. |
2505.18907 |
null |
| 2025-05-24 |
Security Concerns for Large Language Models: A Survey |
Miles Q. Li et.al. |
2505.18889 |
null |
| 2025-05-24 |
Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework |
Binhao Ma et.al. |
2505.18864 |
link |
| 2025-05-23 |
Survival Games: Human-LLM Strategic Showdowns under Severe Resource Scarcity |
Zhihong Chen et.al. |
2505.17937 |
link |
| 2025-05-23 |
Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking? |
Chengda Lu et.al. |
2505.17650 |
null |
| 2025-05-23 |
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models |
Jiawei Kong et.al. |
2505.17601 |
null |
| 2025-05-23 |
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs |
Linbao Li et.al. |
2505.17598 |
link |
| 2025-05-23 |
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models |
Zifan Peng et.al. |
2505.17568 |
link |
| 2025-05-23 |
Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models |
Wenhan Chang et.al. |
2505.17519 |
null |
| 2025-05-22 |
Refusal Direction is Universal Across Safety-Aligned Languages |
Xinpeng Wang et.al. |
2505.17306 |
null |
| 2025-05-22 |
In-Context Watermarks for Large Language Models |
Yepeng Liu et.al. |
2505.16934 |
null |
| 2025-05-22 |
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques |
Jianing Geng et.al. |
2505.16765 |
null |
| 2025-05-23 |
Finetuning-Activated Backdoors in LLMs |
Thibaud Gloaguen et.al. |
2505.16567 |
link |
| 2025-05-22 |
Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models |
Zhaoxin Wang et.al. |
2505.16446 |
null |
| 2025-05-22 |
Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers |
Viet-Anh Nguyen et.al. |
2505.16241 |
null |
| 2025-05-22 |
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning |
Kaiwen Zhou et.al. |
2505.16186 |
null |
| 2025-05-21 |
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval |
Taiye Chen et.al. |
2505.15753 |
null |
| 2025-05-21 |
Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses |
Xiaoxue Yang et.al. |
2505.15738 |
link |
| 2025-05-21 |
Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries |
Yuhao Wang et.al. |
2505.15420 |
null |
| 2025-05-21 |
Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models |
Zirui Song et.al. |
2505.15406 |
link |
| 2025-05-20 |
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment |
Wonje Jeung et.al. |
2505.14667 |
null |
| 2025-05-20 |
sudoLLM : On Multi-role Alignment of Language Models |
Soumadeep Saha et.al. |
2505.14607 |
null |
| 2025-05-20 |
Can Large Language Models Really Recognize Your Name? |
Dzung Pham et.al. |
2505.14549 |
link |
| 2025-05-20 |
Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders |
Agam Goyal et.al. |
2505.14536 |
null |
| 2025-05-20 |
Lessons from Defending Gemini Against Indirect Prompt Injections |
Chongyang Shi et.al. |
2505.14534 |
null |
| 2025-05-20 |
Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs |
Jiawen Wang et.al. |
2505.14368 |
null |
| 2025-05-20 |
Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion |
Tiehan Cui et.al. |
2505.14316 |
null |
| 2025-05-20 |
EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection |
Yijie Lu et.al. |
2505.14289 |
null |
| 2025-05-20 |
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs |
Darpan Aswal et.al. |
2505.14226 |
null |
| 2025-05-20 |
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models |
Guangke Chen et.al. |
2505.14103 |
null |
| 2025-05-19 |
Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks |
Narek Maloyan et.al. |
2505.13348 |
null |
| 2025-05-19 |
I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models |
Alice Plebe et.al. |
2505.13302 |
link |
| 2025-05-19 |
The Hidden Dangers of Browsing AI Agents |
Mykyta Mudryi et.al. |
2505.13076 |
null |
| 2025-05-18 |
BadNAVer: Exploring Jailbreak Attacks On Vision-and-Language Navigation |
Wenqi Lyu et.al. |
2505.12443 |
null |
| 2025-05-18 |
CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement |
Gauri Kholkar et.al. |
2505.12368 |
null |
| 2025-05-18 |
The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models |
Linghan Huang et.al. |
2505.12287 |
null |
| 2025-05-17 |
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement |
Peng Ding et.al. |
2505.12060 |
link |
| 2025-05-17 |
Multilingual Collaborative Defense for Large Language Models |
Hongliang Li et.al. |
2505.11835 |
link |
| 2025-05-17 |
JULI: Jailbreak Large Language Models by Self-Introspection |
Jesson Wang et.al. |
2505.11790 |
null |
| 2025-05-16 |
EnvInjection: Environmental Prompt Injection Attack to Multi-modal Web Agents |
Xilong Wang et.al. |
2505.11717 |
null |
| 2025-05-16 |
ProxyPrompt: Securing System Prompts against Prompt Extraction Attacks |
Zhixiong Zhuang et.al. |
2505.11459 |
null |
| 2025-05-16 |
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs |
Sijia Chen et.al. |
2505.11413 |
null |
| 2025-05-16 |
AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models |
Jiacheng Liang et.al. |
2505.10846 |
link |
| 2025-05-16 |
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs |
Ran Li et.al. |
2505.10838 |
null |
| 2025-05-15 |
Dark LLMs: The Growing Threat of Unaligned AI Models |
Michael Fire et.al. |
2505.10066 |
null |
| 2025-05-15 |
Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data |
Adel ElZemity et.al. |
2505.09974 |
null |
| 2025-05-16 |
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization |
Yidan Wang et.al. |
2505.09921 |
link |
| 2025-05-14 |
Adversarial Attack on Large Language Models using Exponentiated Gradient Descent |
Sajib Biswas et.al. |
2505.09820 |
link |
| 2025-05-14 |
Adversarial Suffix Filtering: a Defense Pipeline for LLMs |
David Khachaturov et.al. |
2505.09602 |
null |
| 2025-05-11 |
TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis |
Longtian Wang et.al. |
2505.08804 |
null |
| 2025-05-13 |
A Large-Scale Empirical Analysis of Custom GPTs' Vulnerabilities in the OpenAI Ecosystem |
Sunday Oyinlola Ogundoyin et.al. |
2505.08148 |
link |
| 2025-05-12 |
Concept-Level Explainability for Auditing & Steering LLM Responses |
Kenza Amara et.al. |
2505.07610 |
link |
| 2025-05-12 |
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models |
Haoran Gu et.al. |
2505.07167 |
null |
| 2025-05-10 |
Jailbreaking the Text-to-Video Generative Models |
Jiayang Liu et.al. |
2505.06679 |
null |
| 2025-05-10 |
Practical Reasoning Interruption Attacks on Reasoning Large Language Models |
Yu Cui et.al. |
2505.06643 |
null |
| 2025-05-10 |
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model |
Xinyue Lou et.al. |
2505.06538 |
link |
| 2025-05-10 |
System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection |
Jiawei Guo et.al. |
2505.06493 |
null |
| 2025-05-08 |
Defending against Indirect Prompt Injection by Instruction Detection |
Tongyu Wen et.al. |
2505.06311 |
link |
| 2025-05-09 |
AgentXploit: End-to-End Redteaming of Black-Box AI Agents |
Zhun Wang et.al. |
2505.05849 |
null |
| 2025-05-12 |
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities |
Kalyan Nakka et.al. |
2505.05619 |
link |
| 2025-05-07 |
Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs |
Chetan Pathade et.al. |
2505.04806 |
null |
| 2025-05-07 |
Safeguard-by-Development: A Privacy-Enhanced Development Paradigm for Multi-Agent Collaboration Systems |
Jian Cui et.al. |
2505.04799 |
null |
| 2025-05-07 |
A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models |
Pedro Pinacho-Davidson et.al. |
2505.04784 |
null |
| 2025-05-07 |
The Aloe Family Recipe for Open and Specialized Healthcare LLMs |
Dario Garcia-Gasulla et.al. |
2505.04388 |
null |
| 2025-05-07 |
Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety |
Variath Madhupal Gautham Nair et.al. |
2505.04146 |
null |
| 2025-05-06 |
LlamaFirewall: An open source guardrail system for building secure AI agents |
Sahana Chennabasappa et.al. |
2505.03574 |
null |
| 2025-05-03 |
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs |
Haoming Yang et.al. |
2505.02862 |
null |
| 2025-05-04 |
Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents |
Christian Schroeder de Witt et.al. |
2505.02077 |
null |
| 2025-05-05 |
Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System |
Sheikh Samit Muhaimin et.al. |
2505.01315 |
null |
| 2025-05-01 |
OET: Optimization-based prompt injection Evaluation Toolkit |
Jinsheng Pan et.al. |
2505.00843 |
link |
| 2025-05-05 |
The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them) |
Zihao Wang et.al. |
2505.00626 |
null |
| 2025-04-29 |
HyPerAlign: Hypotheses-driven Personalized Alignment |
Cristina Garbacea et.al. |
2505.00038 |
null |
| 2025-04-30 |
XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs |
Marco Arazzi et.al. |
2504.21700 |
null |
| 2025-04-30 |
Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs |
Pan Suo et.al. |
2504.21680 |
null |
| 2025-04-30 |
The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning |
Siyi Chen et.al. |
2504.21307 |
null |
| 2025-04-29 |
CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks |
Rui Wang et.al. |
2504.21228 |
null |
| 2025-04-29 |
ACE: A Security Architecture for LLM-Integrated App Systems |
Evan Li et.al. |
2504.20984 |
null |
| 2025-04-29 |
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security |
Zikui Cai et.al. |
2504.20965 |
link |
| 2025-04-29 |
Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption |
Wenxiao Wang et.al. |
2504.20769 |
null |
| 2025-04-29 |
Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression |
Yu Cui et.al. |
2504.20493 |
null |
| 2025-04-29 |
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction |
Yulin Chen et.al. |
2504.20472 |
null |
| 2025-04-29 |
Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems |
Shiqian Zhao et.al. |
2504.20376 |
null |
| 2025-04-28 |
Prompt Injection Attack to Tool Selection in LLM Agents |
Jiawen Shi et.al. |
2504.19793 |
null |
| 2025-04-29 |
Security Steerability is All You Need |
Itay Hazan et.al. |
2504.19521 |
null |
| 2025-04-28 |
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift |
Julien Piet et.al. |
2504.19440 |
link |
| 2025-04-27 |
Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling |
Ishan Kavathekar et.al. |
2504.19277 |
link |
| 2025-04-26 |
Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs |
Mohammad Akbar-Tajari et.al. |
2504.19019 |
link |
| 2025-04-22 |
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks |
Ivan Evtimov et.al. |
2504.18575 |
link |
| 2025-04-25 |
Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections |
Narek Maloyan et.al. |
2504.18333 |
null |
| 2025-04-23 |
Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate |
Senmao Qi et.al. |
2504.16489 |
null |
| 2025-04-20 |
Breaking the Prompt Wall (I): A Real-World Case Study of Attacking ChatGPT via Lightweight Prompt Injection |
Xiangyu Chang et.al. |
2504.16125 |
null |
| 2025-04-26 |
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models |
Siyuan Liang et.al. |
2504.15512 |
null |
| 2025-04-21 |
MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning |
Yahan Yang et.al. |
2504.15241 |
null |
| 2025-04-20 |
Prompt-Hacking: The New p-Hacking? |
Thomas Kosch et.al. |
2504.14571 |
null |
| 2025-04-20 |
LLM-Enabled In-Context Learning for Data Collection Scheduling in UAV-assisted Sensor Networks |
Yousef Emami et.al. |
2504.14556 |
null |
| 2025-04-25 |
Manipulating Multimodal Agents via Cross-Modal Prompt Injection |
Le Wang et.al. |
2504.14348 |
null |
| 2025-04-18 |
DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification |
Yu Li et.al. |
2504.13562 |
null |
| 2025-04-15 |
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents |
Salman Rahman et.al. |
2504.13203 |
null |
| 2025-04-15 |
Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI |
Jirui Yang et.al. |
2504.13201 |
null |
| 2025-04-17 |
GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms |
Sinan He et.al. |
2504.13052 |
null |
| 2025-04-17 |
ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition |
Haidar Khan et.al. |
2504.12562 |
link |
| 2025-04-14 |
You've Changed: Detecting Modification of Black-Box Large Language Models |
Alden Dima et.al. |
2504.12335 |
null |
| 2025-04-15 |
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks |
Yupei Liu et.al. |
2504.11358 |
link |
| 2025-04-16 |
Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails |
William Hackett et.al. |
2504.11168 |
null |
| 2025-04-15 |
Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models |
Jiangtao Liu et.al. |
2504.11106 |
null |
| 2025-04-14 |
The Jailbreak Tax: How Useful are Your Jailbreak Outputs? |
Kristina Nikolić et.al. |
2504.10694 |
link |
| 2025-04-14 |
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding |
Tao Zhang et.al. |
2504.10465 |
link |
| 2025-04-16 |
LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks |
Soumyadeep Pal et.al. |
2504.10185 |
link |
| 2025-04-14 |
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability |
Yichi Zhang et.al. |
2504.10081 |
null |
| 2025-04-14 |
StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models |
Yang Feng et.al. |
2504.09841 |
null |
| 2025-04-13 |
The Structural Safety Generalization Problem |
Julius Broomfield et.al. |
2504.09712 |
link |
| 2025-04-13 |
Mitigating Many-Shot Jailbreaking |
Christopher M. Ackerman et.al. |
2504.09604 |
null |
| 2025-04-13 |
ControlNET: A Firewall for RAG-based LLM System |
Hongwei Yao et.al. |
2504.09593 |
null |
| 2025-04-13 |
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender |
Weixiang Zhao et.al. |
2504.09466 |
null |
| 2025-04-13 |
SaRO: Enhancing LLM Safety through Reasoning-based Alignment |
Yutao Mou et.al. |
2504.09420 |
null |
| 2025-04-12 |
Feature-Aware Malicious Output Detection and Mitigation |
Weilong Dong et.al. |
2504.09191 |
null |
| 2025-04-10 |
Geneshift: Impact of different scenario shift on Jailbreaking LLM |
Tianyi Wu et.al. |
2504.08104 |
null |
| 2025-04-10 |
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge |
Riccardo Cantini et.al. |
2504.07887 |
link |
| 2025-04-10 |
Defense against Prompt Injection Attacks via Mixture of Encodings |
Ruiyi Zhang et.al. |
2504.07467 |
link |
| 2025-04-09 |
Bypassing Safety Guardrails in LLMs Using Humor |
Pedro Cisneros-Velarde et.al. |
2504.06577 |
null |
| 2025-04-08 |
Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking |
Junxi Chen et.al. |
2504.05838 |
link |
| 2025-04-08 |
Separator Injection Attack: Uncovering Dialogue Biases in Large Language Models Caused by Role Separators |
Xitao Li et.al. |
2504.05689 |
null |
| 2025-04-08 |
Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking |
Yu-Hang Wu et.al. |
2504.05652 |
link |
| 2025-04-07 |
A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models |
Carlos Peláez-González et.al. |
2504.04976 |
null |
| 2025-04-08 |
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models |
Yubo Li et.al. |
2504.04717 |
link |
| 2025-04-06 |
StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation |
Shenyang Liu et.al. |
2504.04373 |
null |
| 2025-04-08 |
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model |
Yi Nian et.al. |
2504.03770 |
link |
| 2025-04-03 |
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment |
Yifan Wang et.al. |
2504.02193 |
null |
| 2025-04-02 |
Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses |
Zhengchun Shang et.al. |
2504.02080 |
null |
| 2025-04-02 |
Representation Bending for Large Language Model Safety |
Ashkan Yousefpour et.al. |
2504.01550 |
link |
| 2025-04-02 |
LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution |
Zhuoran Yang et.al. |
2504.01533 |
null |
| 2025-04-07 |
PiCo: Jailbreaking Multimodal Large Language Models via $\textbf{Pi}$ctorial $\textbf{Co}$ de Contextualization |
Aofan Liu et.al. |
2504.01444 |
null |
| 2025-04-07 |
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks |
Jiawei Wang et.al. |
2504.01308 |
link |
| 2025-04-02 |
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning |
Si Chen et.al. |
2504.01278 |
null |
| 2025-04-01 |
Multilingual and Multi-Accent Jailbreaking of Audio LLMs |
Jaechul Roh et.al. |
2504.01094 |
null |
| 2025-04-01 |
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics |
Shide Zhou et.al. |
2504.00446 |
null |
| 2025-03-31 |
Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms |
Shuoming Zhang et.al. |
2503.24191 |
null |
| 2025-03-29 |
Encrypted Prompt: Securing LLM Applications Against Unauthorized Actions |
Shih-Han Chan et.al. |
2503.23250 |
null |
| 2025-03-27 |
Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing |
Johan Wahréus et.al. |
2503.21598 |
null |
| 2025-03-27 |
Harnessing Chain-of-Thought Metadata for Task Routing and Adversarial Prompt Detection |
Ryan Marinelli et.al. |
2503.21464 |
link |
| 2025-03-26 |
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy |
Joonhyun Jeong et.al. |
2503.20823 |
link |
| 2025-03-26 |
Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models |
Shih-Wen Ke et.al. |
2503.20320 |
null |
| 2025-03-26 |
sudo rm -rf agentic_security |
Sejin Lee et.al. |
2503.20279 |
link |
| 2025-03-24 |
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks |
Wenhao You et.al. |
2503.19134 |
null |
| 2025-03-23 |
SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment |
Ruoxi Cheng et.al. |
2503.18991 |
null |
| 2025-03-24 |
Defeating Prompt Injections by Design |
Edoardo Debenedetti et.al. |
2503.18813 |
null |
| 2025-03-23 |
Metaphor-based Jailbreaking Attacks on Text-to-Image Models |
Chenyu Zhang et.al. |
2503.17987 |
null |
| 2025-03-23 |
Smoke and Mirrors: Jailbreaking LLM-based Code Generation via Implicit Malicious Prompts |
Sheng Ouyang et.al. |
2503.17953 |
null |