GitHub - tyskill/llm-arxiv-daily: 🎓Automatically Update LLM Security Papers Daily using Github Actions

Updated on 2026.05.04

Usage instructions: here

Table of Contents

Model Security
Prompt Injection
Code Embedding
Model Context Protocol
Supply Chain Attacks

Model Security

Publish Date	Title	Authors	PDF	Code
2025-07-23	HySafe-AI: Hybrid Safety Architectural Analysis Framework for AI Systems: A Case Study	Mandar Pitale et.al.	2507.17118	null
2025-07-22	Towards Trustworthy AI: Secure Deepfake Detection using CNNs and Zero-Knowledge Proofs	H M Mohaimanul Islam et.al.	2507.17010	null
2025-07-22	Depth Gives a False Sense of Privacy: LLM Internal States Inversion	Tian Dong et.al.	2507.16372	null
2025-07-19	Combining Cost-Constrained Runtime Monitors for AI Safety	Tim Tian Hua et.al.	2507.15886	null
2025-07-19	When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems	Qibing Ren et.al.	2507.14660	null
2025-07-22	Mapping the Parasocial AI Market: User Trends, Engagement and Risks	Zilan Qian et.al.	2507.14226	null
2025-07-15	Mitigating Trojanized Prompt Chains in Educational LLM Use Cases: Experimental Findings and Detection Tool Design	Richard M. Charles et.al.	2507.14207	null
2025-07-23	Fake or Real: The Impostor Hunt in Texts for Space Operations	Agata Kaczmarek et.al.	2507.13508	null
2025-07-17	Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework	Rishane Dassanayake et.al.	2507.12872	null
2025-07-16	LLMs Encode Harmfulness and Refusal Separately	Jiachen Zhao et.al.	2507.11878	null
2025-07-09	The AI Shadow War: SaaS vs. Edge Computing Architectures	Rhea Pritham Marpu et.al.	2507.11545	null
2025-07-15	Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety	Tomek Korbak et.al.	2507.11473	null
2025-07-14	3S-Attack: Spatial, Spectral and Semantic Invisible Backdoor Attack Against DNN Models	Jianyao Yin et.al.	2507.10733	null
2025-07-16	From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents	Tatiana Petrova et.al.	2507.10644	null
2025-07-14	Can You Detect the Difference?	İsmail Tarım et.al.	2507.10475	null
2025-07-14	BlueGlass: A Framework for Composite AI Safety	Harshal Nandigramwar et.al.	2507.10106	null
2025-07-13	Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications	Jia Yi Goh et.al.	2507.09820	null
2025-07-12	Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers	Santhosh Kumar Ravindran et.al.	2507.09406	null
2025-07-06	Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking	Aldan Creo et.al.	2507.08014	null
2025-07-15	Secure Cooperative Gradient Coding: Optimality, Reliability, and Global Privacy	Shudi Weng et.al.	2507.07565	null
2025-07-09	Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models	Aaron Dharna et.al.	2507.06466	null
2025-07-08	Humans overrely on overconfident language models, across languages	Neil Rathi et.al.	2507.06306	null
2025-07-07	Evaluating the Critical Risks of Amazon's Nova Premier under the Frontier Model Safety Framework	Satyapriya Krishna et.al.	2507.06260	null
2025-07-08	CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations	Xiaohu Li et.al.	2507.06043	null
2025-07-08	Domain adaptation of large language models for geotechnical applications	Lei Fan et.al.	2507.05613	null
2025-07-07	When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors	Scott Emmons et.al.	2507.05246	null
2025-07-07	Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message	Wei Duan et.al.	2507.04673	null
2025-07-03	From Turing to Tomorrow: The UK's Approach to AI Regulation	Oliver Ritchie et.al.	2507.03050	null
2025-07-01	`For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts	Annika M Schoene et.al.	2507.02990	null
2025-07-01	GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models	Seshu Tirupathi et.al.	2507.02986	null
2025-07-03	Moral Responsibility or Obedience: What Do We Want from AI?	Joseph Boland et.al.	2507.02788	null
2025-07-03	Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks	Sizhe Chen et.al.	2507.02735	null
2025-07-02	How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks	Rahul Ramachandran et.al.	2507.01955	null
2025-07-02	Out-of-Distribution Detection Methods Answer the Wrong Questions	Yucen Lily Li et.al.	2507.01831	null
2025-07-01	SAFER: Probing Safety in Reward Models with Sparse Autoencoder	Sihang Li et.al.	2507.00665	null
2025-06-30	Thinking About Thinking: SAGE-nano's Inverse Reasoning for Self-Aware Language Models	Basab Jha et.al.	2507.00092	null
2025-06-30	Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments	Christoph Schnabl et.al.	2506.23706	null
2025-06-30	A New Perspective On AI Safety Through Control Theory Methodologies	Lars Ullrich et.al.	2506.23703	null
2025-06-29	Securing AI Systems: A Guide to Known Attacks and Impacts	Naoto Kiribuchi et.al.	2506.23296	null
2025-06-28	MPC in the Quantum Head (or: Superposition-Secure (Quantum) Zero-Knowledge)	Andrea Coladangelo et.al.	2506.22961	null
2025-06-25	Mitigating Gambling-Like Risk-Taking Behaviors in Large Language Models: A Behavioral Economics Approach to AI Safety	Y. Du et.al.	2506.22496	null
2025-06-24	Report on NSF Workshop on Science of Safe AI	Rajeev Alur et.al.	2506.22492	null
2025-06-27	A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety	Camille François et.al.	2506.22183	null
2025-06-27	SODA: Out-of-Distribution Detection in Domain-Shifted Point Clouds via Neighborhood Propagation	Adam Goodge et.al.	2506.21892	null
2025-06-30	The Singapore Consensus on Global AI Safety Research Priorities	Yoshua Bengio et.al.	2506.20702	null
2025-06-25	Probing AI Safety with Source Code	Ujwal Narayan et.al.	2506.20471	null
2025-06-24	Persona Features Control Emergent Misalignment	Miles Wang et.al.	2506.19823	null
2025-06-21	AI Safety vs. AI Security: Demystifying the Distinction and Boundaries	Zhiqiang Lin et.al.	2506.18932	null
2025-06-23	How Robust is Model Editing after Fine-Tuning? An Empirical Study on Text-to-Image Diffusion Models	Feng He et.al.	2506.18428	null
2025-06-23	LLM-Integrated Digital Twins for Hierarchical Resource Allocation in 6G Networks	Majumder Haider et.al.	2506.18293	null
2025-06-22	AI Through the Human Lens: Investigating Cognitive Theories in Machine Psychology	Akash Kundu et.al.	2506.18156	null
2025-06-22	$φ^{\infty}$ : Clause Purification, Embedding Realignment, and the Total Suppression of the Em Dash in Autoregressive Language Models	Bugra Kilictas et.al.	2506.18129	null
2025-06-21	Out of Control -- Why Alignment Needs Formal Control Theory (and an Alignment Control Stack)	Elija Perrier et.al.	2506.17846	null
2025-06-20	SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification	Zhenglin Lai et.al.	2506.17368	null
2025-06-19	PL-Guard: Benchmarking Language Model Safety for Polish	Aleksandra Krasnodębska et.al.	2506.16322	null
2025-06-19	Probing the Robustness of Large Language Models Safety to Latent Perturbations	Tianle Gu et.al.	2506.16078	link
2025-06-18	LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning	Gabrel J. Perin et.al.	2506.15606	link
2025-06-17	TriGuard: Testing Model Safety with Attribution Entropy, Verification, and Drift	Dipesh Tharu Mahato et.al.	2506.14217	link
2025-06-17	The Ethics of Generative AI in Anonymous Spaces: A Case Study of 4chan's /pol/ Board	Parth Gaba et.al.	2506.14191	null
2025-06-17	Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions	Junfeng Jiao et.al.	2506.13510	link
2025-06-16	Position: Certified Robustness Does Not (Yet) Imply Model Security	Andrew C. Cullen et.al.	2506.13024	null
2025-06-15	Intriguing Frequency Interpretation of Adversarial Robustness for CNNs and ViTs	Lu Chen et.al.	2506.12875	null
2025-06-14	OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics	Vineeth Dorna et.al.	2506.12618	link
2025-06-14	Tiered Agentic Oversight: A Hierarchical Multi-Agent System for AI Safety in Healthcare	Yubin Kim et.al.	2506.12482	null
2025-06-13	InfoFlood: Jailbreaking Large Language Models with Information Overload	Advait Yadav et.al.	2506.12274	null
2025-06-13	Hatevolution: What Static Benchmarks Don't Tell Us	Chiara Di Bonaventura et.al.	2506.12148	null
2025-06-13	Improving Large Language Model Safety with Contrastive Representation Learning	Samuel Simko et.al.	2506.11938	link
2025-06-13	Model Organisms for Emergent Misalignment	Edward Turner et.al.	2506.11613	null
2025-06-12	The Alignment Trap: Complexity Barriers	Jasper Yao et.al.	2506.10304	null
2025-06-11	Data-Centric Safety and Ethical Measures for Data and AI Governance	Srija Chakraborty et.al.	2506.10217	null
2025-06-09	LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges	Haoyang Li et.al.	2506.10022	link
2025-06-08	Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations	Zhiyu Xue et.al.	2506.09067	null
2025-06-11	Societal AI Research Has Become Less Interdisciplinary	Dror Kris Markus et.al.	2506.08738	null
2025-06-11	AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin	Shuo Yang et.al.	2506.08473	link
2025-06-06	Benchmarking Misuse Mitigation Against Covert Adversaries	Davis Brown et.al.	2506.06414	link
2025-06-03	Rational Superautotrophic Diplomacy (SupraAD); A Conceptual Framework for Alignment Based on Interdisciplinary Findings on the Fundamentals of Cognition	Andrea Morris et.al.	2506.05389	null
2025-06-05	Normative Conflicts and Shallow AI Alignment	Raphaël Millière et.al.	2506.04679	null
2025-06-04	Watermarking Degrades Alignment in Language Models: Analysis and Mitigation	Apurv Verma et.al.	2506.04462	link
2025-06-04	Misalignment or misuse? The AGI alignment tradeoff	Max Hellrigel-Holderbaum et.al.	2506.03755	null
2025-06-04	Bridging the Artificial Intelligence Governance Gap: The United States' and China's Divergent Approaches to Governing General-Purpose Artificial Intelligence	Oliver Guest et.al.	2506.03497	null
2025-06-03	MAEBE: Multi-Agent Emergent Behavior Framework	Sinem Erisken et.al.	2506.03053	null
2025-06-02	Trojan Horse Hunt in Time Series Forecasting for Space Operations	Krzysztof Kotowski et.al.	2506.01849	null
2025-06-02	ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs	Zeming Wei et.al.	2506.01770	link
2025-06-02	Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation	Yuan Gan et.al.	2506.01591	link
2025-05-31	Wide Reflective Equilibrium in LLM Alignment: Bridging Moral Epistemology and AI Safety	Matthew Brophy et.al.	2506.00415	null
2025-05-30	Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences	Mingqian Zheng et.al.	2506.00195	null
2025-05-30	Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment	Kundan Krishna et.al.	2506.00166	null
2025-05-30	TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis	Xiaorui Wu et.al.	2505.24672	link
2025-05-30	Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization	Utsav Maskey et.al.	2505.24621	null
2025-05-30	The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It	Zheng-Xin Yong et.al.	2505.24119	null
2025-05-29	OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities	Sahil Verma et.al.	2505.23856	link
2025-05-27	Watermarking Without Standards Is Not AI Governance	Alexander Nemecek et.al.	2505.23814	null
2025-05-29	SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents	Kunlun Zhu et.al.	2505.23559	link
2025-05-29	Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models	Mingyu Yu et.al.	2505.23404	null
2025-05-28	Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies	Chenruo Liu et.al.	2505.22829	null
2025-05-28	TensorShield: Safeguarding On-Device Inference by Shielding Critical DNN Tensors with TEE	Tong Sun et.al.	2505.22735	link
2025-05-27	Expert Survey: AI Reliability & Security Research Priorities	Joe O'Brien et.al.	2505.21664	null
2025-05-27	Preventing Adversarial AI Attacks Against Autonomous Situational Awareness: A Maritime Case Study	Mathew J. Walter et.al.	2505.21609	null
2025-05-27	SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge	Fengqing Jiang et.al.	2505.21605	null
2025-05-26	Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts	Hee-Seon Kim et.al.	2505.21556	null
2025-05-27	The Multilingual Divide and Its Impact on Global AI Safety	Aidan Peppin et.al.	2505.21344	null
2025-05-27	Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling	Yichuan Cao et.al.	2505.21074	null
2025-05-26	VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration	Jiahui Geng et.al.	2505.20362	link
2025-05-26	What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs	Sangyeop Kim et.al.	2505.19773	null
2025-05-25	When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas	Steffen Backmann et.al.	2505.19212	link
2025-05-25	GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization	Zixuan Chen et.al.	2505.18979	null
2025-05-24	Guided by Guardrails: Control Barrier Functions as Safety Instructors for Robotic Learning	Maeva Guerrier et.al.	2505.18858	null
2025-05-24	Safety Alignment via Constrained Knowledge Unlearning	Zesheng Shi et.al.	2505.18588	null
2025-05-23	Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary	Licheng Pan et.al.	2505.18325	null
2025-05-23	Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis	Jonathan Bennion et.al.	2505.17636	null
2025-05-23	Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models	Jiawei Kong et.al.	2505.17601	null
2025-05-20	From nuclear safety to LLM security: Applying non-probabilistic risk management strategies to build safe and secure LLM-powered systems	Alexander Gutfraind et.al.	2505.17084	null
2025-05-22	When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques	Jianing Geng et.al.	2505.16765	null
2025-05-22	Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization	Chengcan Wu et.al.	2505.16737	link
2025-05-21	Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack	Silvia Cappelletti et.al.	2505.15323	null
2025-05-20	Foundations of Unknown-aware Machine Learning	Xuefeng Du et.al.	2505.14933	null
2025-05-20	Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas	Yu Ying Chiu et.al.	2505.14633	link
2025-05-19	Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations	Li Ji-An et.al.	2505.13763	null
2025-05-16	Noise Injection Systemically Degrades Large Language Model Safety Guardrails	Prithviraj Singh Shahani et.al.	2505.13500	null
2025-05-19	Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities	Lili Zhang et.al.	2505.13195	null
2025-05-19	Bullying the Machine: How Personas Increase LLM Vulnerability	Ziwei Xu et.al.	2505.12692	null
2025-05-18	Persuasion and Safety in the Era of Generative AI	Haein Kong et.al.	2505.12248	null
2025-05-17	Position Paper: Bounded Alignment: What (Not) To Expect From AGI Agents	Ali A. Minai et.al.	2505.11866	null
2025-05-16	Probing the Vulnerability of Large Language Models to Polysemantic Interventions	Bofan Gong et.al.	2505.11611	null
2025-05-16	Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning	Jingcheng Niu et.al.	2505.11004	link
2025-05-15	Formalising Human-in-the-Loop: Computational Reductions, Failure Modes, and Legal-Moral Responsibility	Maurice Chiodo et.al.	2505.10426	null
2025-05-15	Dark LLMs: The Growing Threat of Unaligned AI Models	Michael Fire et.al.	2505.10066	null
2025-05-15	Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data	Adel ElZemity et.al.	2505.09974	null
2025-05-14	Access Controls Will Solve the Dual-Use Dilemma	Evžen Wybitul et.al.	2505.09341	null
2025-05-16	SecReEvalBench: A Multi-turned Security Resilience Evaluation Benchmark for Large Language Models	Huining Cui et.al.	2505.07584	null
2025-05-09	Offensive Security for AI Systems: Concepts, Practices, and Applications	Josh Harguess et.al.	2505.06380	null
2025-05-08	Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods	Markov Grey et.al.	2505.05541	null
2025-05-08	Reasoning Models Don't Always Say What They Think	Yanda Chen et.al.	2505.05410	null
2025-05-08	Advancing Neural Network Verification through Hierarchical Safety Abstract Interpretation	Luca Marzari et.al.	2505.05235	null
2025-05-08	Belief Filtering for Epistemic Control in Linguistic State Space	Sebastian Dumbrava et.al.	2505.04927	null
2025-05-07	The Aloe Family Recipe for Open and Specialized Healthcare LLMs	Dario Garcia-Gasulla et.al.	2505.04388	null
2025-05-07	Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety	Variath Madhupal Gautham Nair et.al.	2505.04146	null
2025-05-08	An alignment safety case sketch based on debate	Marie Davidsen Buhl et.al.	2505.03989	null
2025-05-05	What Is AI Safety? What Do We Want It to Be?	Jacqueline Harding et.al.	2505.02313	null
2025-05-04	Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents	Christian Schroeder de Witt et.al.	2505.02077	null
2025-05-03	Third-party compliance reviews for frontier AI safety frameworks	Aidan Homewood et.al.	2505.01643	null
2025-05-02	Securing the Future of IVR: AI-Driven Innovation with Agile Security, Data Regulation, and Ethical AI Integration	Khushbu Mehboob Shaikh et.al.	2505.01514	null
2025-04-30	A Domain-Agnostic Scalable AI Safety Ensuring Framework	Beomjun Kim et.al.	2504.20924	null
2025-04-29	When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines	Sachin R. Pendse et.al.	2504.20910	null
2025-04-25	AI Awareness	Xiaojian Li et.al.	2504.20084	null
2025-04-28	Mitigating Societal Cognitive Overload in the Age of AI: Challenges and Directions	Salem Lahlou et.al.	2504.19990	null
2025-05-02	Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents	Vineeth Sai Narajala et.al.	2504.19956	null
2025-04-28	AI Alignment in Medical Imaging: Unveiling Hidden Biases Through Counterfactual Analysis	Haroui Ma et.al.	2504.19621	link
2025-04-26	Latent Adversarial Training Improves the Representation of Refusal	Alexandra Abbas et.al.	2504.18872	null
2025-04-25	AI Safety Assurance for Automated Vehicles: A Survey on Research, Standardization, Regulation	Lars Ullrich et.al.	2504.18328	null
2025-04-25	RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models	Bang An et.al.	2504.18041	null
2025-04-17	Security-First AI: Foundations for Robust and Trustworthy Systems	Krti Tallam et.al.	2504.16110	null
2025-04-21	Safety Co-Option and Compromised National Security: The Self-Fulfilling Prophecy of Weakened AI Risk Thresholds	Heidy Khlaaf et.al.	2504.15088	null
2025-04-20	A Byzantine Fault Tolerance Approach towards AI Safety	John deVadoss et.al.	2504.14668	null
2025-04-20	Seeing Through Risk: A Symbolic Approximation of Prospect Theory	Ali Arslan Yousaf et.al.	2504.14448	null
2025-04-16	AI Safety Should Prioritize the Future of Work	Sanchaita Hazra et.al.	2504.13959	null
2025-04-17	In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?	Ben Bucknall et.al.	2504.12914	null
2025-04-16	Secure Transfer Learning: Training Clean Models Against Backdoor in (Both) Pre-trained Encoders and Downstream Datasets	Yechao Zhang et.al.	2504.11990	null
2025-04-14	The Jailbreak Tax: How Useful are Your Jailbreak Outputs?	Kristina Nikolić et.al.	2504.10694	link
2025-04-14	Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?	Yanbo Wang et.al.	2504.10000	null
2025-04-13	The Structural Safety Generalization Problem	Julius Broomfield et.al.	2504.09712	link
2025-04-13	Mitigating Many-Shot Jailbreaking	Christopher M. Ackerman et.al.	2504.09604	null
2025-04-10	Geneshift: Impact of different scenario shift on Jailbreaking LLM	Tianyi Wu et.al.	2504.08104	null
2025-04-10	The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search	Yutaro Yamada et.al.	2504.08066	link
2025-04-10	Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge	Riccardo Cantini et.al.	2504.07887	link
2025-04-07	Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs	Ling Hu et.al.	2504.04994	null
2025-04-05	Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability	Vishnu Kabir Chhabra et.al.	2504.04215	null
2025-04-05	Among Us: A Sandbox for Agentic Deception	Satvik Golechha et.al.	2504.04072	null
2025-04-03	Improving Harmful Text Detection with Joint Retrieval and External Knowledge	Zidong Yu et.al.	2504.02310	null
2025-04-02	Reinsuring AI: Energy, Agriculture, Finance & Medicine as Precedents for Scalable Governance of Frontier Artificial Intelligence	Nicholas Stetler et.al.	2504.02127	null
2025-03-28	A Framework for Cryptographic Verifiability of End-to-End AI Pipelines	Kar Balan et.al.	2503.22573	null
2025-03-28	Effective Automation to Support the Human Infrastructure in AI Red Teaming	Alice Qian Zhang et.al.	2503.22116	null
2025-03-28	Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories	Yazhou Zhang et.al.	2503.22115	null
2025-03-31	MAD Chairs: A new tool to evaluate AI	Chris Santos-Lang et.al.	2503.20986	null
2025-03-26	The Backfiring Effect of Weak AI Safety Regulation	Benjamin Laufer et.al.	2503.20848	null
2025-03-26	AI Safety in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges	Haoyu Gao et.al.	2503.19444	null
2025-03-18	International Agreements on AI Safety: Review and Recommendations for a Conditional AI Safety Treaty	Rebecca Scholefield et.al.	2503.18956	null
2025-03-22	Intelligence Sequencing and the Path-Dependence of Intelligence Evolution: AGI-First vs. DCI-First as Irreversible Attractors	Andy E. Williams et.al.	2503.17688	null
2025-03-17	AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations	Dillon Bowen et.al.	2503.17388	null
2025-03-18	Temporal Context Awareness: A Defense Framework Against Multi-turn Manipulation Attacks on Large Language Models	Prashant Kulkarni et.al.	2503.15560	link
2025-03-19	A Peek Behind the Curtain: Using Step-Around Prompt Engineering to Identify Bias and Misinformation in GenAI Models	Don Hickerson et.al.	2503.15205	null
2025-03-17	ProDiF: Protecting Domain-Invariant Features to Secure Pre-Trained Models Against Extraction	Tong Zhou et.al.	2503.13224	null
2025-03-17	Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering	Kenneth J. K. Ong et.al.	2503.12722	null

(back to top)

Prompt Injection

Publish Date	Title	Authors	PDF	Code
2025-07-21	Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems	Andrii Balashov et.al.	2507.15613	null
2025-07-21	QSAF: A Novel Mitigation Framework for Cognitive Degradation in Agentic AI	Hammad Atta et.al.	2507.15330	null
2025-07-21	PromptArmor: Simple yet Effective Prompt Injection Defenses	Tianneng Shi et.al.	2507.15219	null
2025-07-20	DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection	Jerry Wang et.al.	2507.15042	null
2025-07-20	AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning	Yi Zhang et.al.	2507.14987	null
2025-07-20	Hierarchical Cross-modal Prompt Learning for Vision-Language Models	Hao Zheng et.al.	2507.14976	null
2025-07-20	Strategic Integration of AI Chatbots in Physics Teacher Preparation: A TPACK-SWOT Analysis of Pedagogical, Epistemic, and Cybersecurity Dimensions	N. Mohammadipour et.al.	2507.14860	null
2025-07-20	Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree	Sam Johnson et.al.	2507.14799	null
2025-07-18	Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models	Palash Nandi et.al.	2507.13761	null
2025-07-18	TopicAttack: An Indirect Prompt Injection Attack via Topic Transition	Yulin Chen et.al.	2507.13686	null
2025-07-17	Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers	Liang Lin et.al.	2507.13474	null
2025-07-17	Prompt Injection 2.0: Hybrid AI Threats	Jeremy McHugh et.al.	2507.13169	null
2025-07-17	MAD-Spear: A Conformity-Driven Prompt Injection Attack on Multi-Agent Debate Systems	Yu Cui et.al.	2507.13038	null
2025-07-16	Exploiting Jailbreaking Vulnerabilities in Generative AI to Bypass Ethical Safeguards for Facilitating Phishing Attacks	Rina Mishra et.al.	2507.12185	null
2025-07-16	LLMs Encode Harmfulness and Refusal Separately	Jiachen Zhao et.al.	2507.11878	null
2025-07-15	Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility	Brendan Murphy et.al.	2507.11630	null
2025-07-14	ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning	Zhengyue Zhao et.al.	2507.11500	null
2025-07-15	The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs	Zichen Wen et.al.	2507.11097	null
2025-07-17	SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems	Wenliang Shan et.al.	2507.08898	null
2025-07-10	A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking	Zhengye Han et.al.	2507.08207	null
2025-07-10	Defending Against Prompt Injection With a Few DefensiveTokens	Sizhe Chen et.al.	2507.07974	null
2025-07-10	GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing	Peiyan Zhang et.al.	2507.07735	null
2025-07-10	May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks	Nishit V. Pandya et.al.	2507.07417	null
2025-07-09	An attention-aware GNN-based input defender against multi-turn jailbreak on LLMs	Zixuan Huang et.al.	2507.07146	null
2025-07-11	The Dark Side of LLMs Agent-based Attacks for Complete Computer Takeover	Matteo Lupinacci et.al.	2507.06850	null
2025-07-09	On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks	Stephen Obadinma et.al.	2507.06489	null
2025-07-09	Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models	Aaron Dharna et.al.	2507.06466	null
2025-07-08	Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms	Tarek Gasmi et.al.	2507.06323	null
2025-07-08	The bitter lesson of misuse detection	Hadrien Mariaccia et.al.	2507.06282	null
2025-07-08	Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review	Zhicheng Lin et.al.	2507.06185	null
2025-07-08	CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations	Xiaohu Li et.al.	2507.06043	null
2025-07-08	TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data	Aravind Cheruvu et.al.	2507.05660	null
2025-07-08	How Not to Detect Prompt Injections with an LLM	Sarthak Choudhary et.al.	2507.05630	null
2025-07-07	A Systematization of Security Vulnerabilities in Computer Use Agents	Daniel Jones et.al.	2507.05445	null
2025-07-07	Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models	Ziqi Miao et.al.	2507.05248	null
2025-07-07	Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message	Wei Duan et.al.	2507.04673	null
2025-07-06	Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking	Tim Beyer et.al.	2507.04446	null
2025-07-06	Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs	Xiaomeng Hu et.al.	2507.04365	null
2025-07-04	On Jailbreaking Quantized Language Models Through Fault Injection Attacks	Noureldin Zahran et.al.	2507.03236	null
2025-07-03	Adversarial Manipulation of Reasoning Models using Internal Representations	Kureha Yamaguchi et.al.	2507.03167	null
2025-07-03	LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users	Almog Hilel et.al.	2507.02850	null
2025-07-03	Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection	Ziqi Miao et.al.	2507.02844	null
2025-07-03	Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models	Riccardo Cantini et.al.	2507.02799	null
2025-07-03	Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks	Sizhe Chen et.al.	2507.02735	null
2025-07-03	PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage	Krishna Kanth Nakka et.al.	2507.02332	null
2025-07-02	MGC: A Compiler Framework Exploiting Compositional Blindness in Aligned LLMs for Malware Generation	Lu Yan et.al.	2507.02057	null
2025-07-02	SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism	Beitao Chen et.al.	2507.01513	null
2025-07-01	Reasoning as an Adaptive Defense for Safety	Taeyoun Kim et.al.	2507.00971	null
2025-07-01	SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents	Siyuan Liang et.al.	2507.00841	null
2025-07-02	Transferable Modeling Strategies for Low-Resource LLM Tasks: A Prompt and Alignment-Based Approach	Shuangquan Lyu et.al.	2507.00601	null
2025-06-30	Linearly Decoding Refused Knowledge in Aligned Language Models	Aryan Shrivastava et.al.	2507.00239	null
2025-06-30	Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models	Tung-Ling Li et.al.	2506.24056	null
2025-06-30	Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages	Ruhina Tabasshum Prome et.al.	2506.23930	null
2025-06-30	Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models	Maria Carolina Cornelia Wit et.al.	2506.23576	null
2025-06-29	From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows	Mohamed Amine Ferrag et.al.	2506.23260	null
2025-06-28	Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models	Younwoo Choi et.al.	2506.22957	null
2025-06-27	VERA: Variational Inference Framework for Jailbreaking Large Language Models	Anamika Lochab et.al.	2506.22666	null
2025-06-27	MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs	Boyuan Chen et.al.	2506.22557	null
2025-07-01	Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center	James Wen et.al.	2506.22523	null
2025-06-27	A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety	Camille François et.al.	2506.22183	null
2025-06-27	Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses	Mohamed Ahmed et.al.	2506.21972	null
2025-06-24	PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty	Jinwen He et.al.	2506.19563	null
2025-06-24	MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models	Yinan Xia et.al.	2506.19257	null
2025-06-23	Command-V: Pasting LLM Behaviors via Activation Profiles	Barry Wang et.al.	2506.19140	null
2025-06-23	Enhancing Security in LLM Applications: A Performance Evaluation of Early Detection Systems	Valerii Gakh et.al.	2506.19109	null
2025-06-23	Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks	Xiaodong Wu et.al.	2506.18543	null
2025-06-23	NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation	Yu Xie et.al.	2506.18325	null
2025-06-22	Multi-turn Jailbreaking via Global Refinement and Active Fabrication	Hua Tang et.al.	2506.17881	null
2025-06-20	Semantic-Aware Parsing for Security Logs	Julien Piet et.al.	2506.17512	null
2025-06-20	From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers	Jingtong Su et.al.	2506.17052	null
2025-06-20	MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning	Muyang Zheng et.al.	2506.16792	null
2025-06-20	Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models	Lei Jiang et.al.	2506.16760	null
2025-06-19	Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models	Biao Yi et.al.	2506.16447	null
2025-06-19	Probing the Robustness of Large Language Models Safety to Latent Perturbations	Tianle Gu et.al.	2506.16078	link
2025-06-18	Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts	Kartik Sharma et.al.	2506.15751	null
2025-06-18	Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers	Tommaso Green et.al.	2506.15674	link
2025-06-18	From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem	Yanxu Mao et.al.	2506.15170	null
2025-06-17	OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents	Thomas Kuntz et.al.	2506.14866	link
2025-06-17	AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models	Ads Dawson et.al.	2506.14682	link
2025-06-16	Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations	Abhilekh Borah et.al.	2506.13901	null
2025-06-17	Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions	Junfeng Jiao et.al.	2506.13510	link
2025-06-15	Jailbreak Strength and Model Similarity Predict Transferability	Rico Angell et.al.	2506.12913	null
2025-06-15	Universal Jailbreak Suffixes Are Strong Attention Hijackers	Matan Ben-Tov et.al.	2506.12880	link
2025-06-15	SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression	Yucheng Li et.al.	2506.12707	null
2025-06-15	Alphabet Index Mapping: Jailbreaking LLMs through Semantic Dissimilarity	Bilal Saleh Husain et.al.	2506.12685	null
2025-06-14	Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025	Zonghao Ying et.al.	2506.12430	link
2025-06-14	Exploring the Secondary Risks of Large Language Models	Jiawei Chen et.al.	2506.12382	null
2025-06-14	QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety	Taegyeong Lee et.al.	2506.12299	null
2025-06-13	InfoFlood: Jailbreaking Large Language Models with Information Overload	Advait Yadav et.al.	2506.12274	null
2025-06-13	Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models	Jinming Wen et.al.	2506.11521	null
2025-06-12	How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?	Sohee Yang et.al.	2506.10979	null
2025-06-12	SoK: Evaluating Jailbreak Guardrails for Large Language Models	Xunguang Wang et.al.	2506.10597	link
2025-06-10	Evaluation empirique de la sécurisation et de l'alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks	Rafaël Nouailles et.al.	2506.10029	null
2025-06-09	LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges	Haoyang Li et.al.	2506.10022	link
2025-06-11	LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge	Sahar Abdelnabi et.al.	2506.09956	link
2025-06-11	Effective Red-Teaming of Policy-Adherent Agents	Itay Nakash et.al.	2506.09600	null
2025-06-11	AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)	Danush Khanna et.al.	2506.08885	null
2025-06-11	Design Patterns for Securing LLM Agents against Prompt Injections	Luca Beurer-Kellner et.al.	2506.08837	null
2025-06-09	TokenBreak: Bypassing Text Classification Models Through Token Manipulation	Kasimir Schulz et.al.	2506.07948	null
2025-06-11	RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards	Jingnan Zheng et.al.	2506.07736	null
2025-06-09	Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models	Maciej Chrabąszcz et.al.	2506.07645	null
2025-06-09	TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts	Torsten Krauß et.al.	2506.07596	null
2025-06-09	When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment	Yuxin Xiao et.al.	2506.07452	link
2025-06-09	Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures	Yukai Zhou et.al.	2506.07402	null
2025-06-08	AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint	Leheng Sheng et.al.	2506.07022	link
2025-06-10	Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test	Xiaoyuan Zhu et.al.	2506.06975	null
2025-06-06	Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance	Ruizhong Qiu et.al.	2506.06444	link
2025-06-06	Small Models, Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG	Zarreen Reza et.al.	2506.05925	null
2025-06-06	To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt	Zhilong Wang et.al.	2506.05739	null
2025-06-05	Sentinel: SOTA model to protect against prompt injections	Dror Ivry et.al.	2506.05446	null
2025-06-05	Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets	Lei Hsiung et.al.	2506.05346	null
2025-06-05	HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model	Youngwan Lee et.al.	2506.04704	null
2025-06-06	TracLLM: A Generic Framework for Attributing Long Context LLMs	Yanting Wang et.al.	2506.04202	link
2025-06-03	Adversarial Attacks on Robotic Vision Language Action Models	Eliot Krzysztof Jones et.al.	2506.03350	link
2025-06-03	It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics	Matthew Kowal et.al.	2506.02873	null
2025-06-03	ATAG: AI-Agent Application Threat Assessment with Attack Graphs	Parth Atulbhai Gandhi et.al.	2506.02859	null
2025-06-03	From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV	Yousef Emami et.al.	2506.02649	null
2025-06-03	BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage	Kalyan Nakka et.al.	2506.02479	link
2025-06-03	VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents	Tri Cao et.al.	2506.02456	link
2025-06-02	ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs	Zeming Wei et.al.	2506.01770	link
2025-06-02	Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models	Youze Wang et.al.	2506.01307	null
2025-06-01	Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution	Meysam Alizadeh et.al.	2506.01055	null
2025-06-01	Predicting Empirical AI Research Outcomes with Language Models	Jiaxin Wen et.al.	2506.00794	null
2025-06-01	Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning	Weiyang Guo et.al.	2506.00782	null
2025-05-30	TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis	Xiaorui Wu et.al.	2505.24672	link
2025-05-30	Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization	Utsav Maskey et.al.	2505.24621	null
2025-05-30	AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders	Yuqi Zhang et.al.	2505.24519	null
2025-05-30	Model Unlearning via Sparse Autoencoder Subspace Guided Projections	Xu Wang et.al.	2505.24428	null
2025-05-30	From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models	Haibo Jin et.al.	2505.24232	null
2025-05-30	SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems	Xu He et.al.	2505.24201	null
2025-05-29	LLM Agents Should Employ Security Principles	Kaiyuan Zhang et.al.	2505.24019	null
2025-05-29	Securing AI Agents with Information-Flow Control	Manuel Costa et.al.	2505.23643	link
2025-05-29	Understanding Refusal in Language Models with Sparse Autoencoders	Wei Jie Yeo et.al.	2505.23556	link
2025-05-29	Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models	Mingyu Yu et.al.	2505.23404	null
2025-05-28	Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment	Krti Tallam et.al.	2505.22852	null
2025-05-28	Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing	Yifan Lu et.al.	2505.22298	null
2025-05-28	Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models	Yongcan Yu et.al.	2505.22271	null
2025-05-28	Jailbreak Distillation: Renewable Safety Benchmarking	Jingyu Zhang et.al.	2505.22037	null
2025-05-28	RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments	Zeyi Liao et.al.	2505.21936	link
2025-05-27	Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation	Tharindu Kumarage et.al.	2505.21784	null
2025-05-26	Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts	Hee-Seon Kim et.al.	2505.21556	null
2025-05-28	Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space	Yao Huang et.al.	2505.21277	link
2025-05-27	Improved Representation Steering for Language Models	Zhengxuan Wu et.al.	2505.20809	link
2025-05-26	Holes in Latent Space: Topological Signatures Under Adversarial Influence	Aideen Fay et.al.	2505.20435	null
2025-05-26	Lifelong Safety Alignment for Language Models	Haoyu Wang et.al.	2505.20259	link
2025-05-26	Capability-Based Scaling Laws for LLM Red-Teaming	Alexander Panfilov et.al.	2505.20162	link
2025-05-26	Attention! You Vision Language Model Could Be Maliciously Manipulated	Xiaosen Wang et.al.	2505.19911	null
2025-05-26	What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs	Sangyeop Kim et.al.	2505.19773	null
2025-05-26	SGM: A Framework for Building Specification-Guided Moderation Filters	Masoomali Fatehkia et.al.	2505.19766	null
2025-05-26	VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models	Bingrui Sima et.al.	2505.19684	null
2025-05-26	JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models	Jiaxin Song et.al.	2505.19610	null
2025-05-25	GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization	Zixuan Chen et.al.	2505.18979	null
2025-05-25	Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations	Sanjay Kariyappa et.al.	2505.18907	null
2025-05-24	Security Concerns for Large Language Models: A Survey	Miles Q. Li et.al.	2505.18889	null
2025-05-24	Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework	Binhao Ma et.al.	2505.18864	link
2025-05-23	Survival Games: Human-LLM Strategic Showdowns under Severe Resource Scarcity	Zhihong Chen et.al.	2505.17937	link
2025-05-23	Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?	Chengda Lu et.al.	2505.17650	null
2025-05-23	Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models	Jiawei Kong et.al.	2505.17601	null
2025-05-23	One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs	Linbao Li et.al.	2505.17598	link
2025-05-23	JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models	Zifan Peng et.al.	2505.17568	link
2025-05-23	Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models	Wenhan Chang et.al.	2505.17519	null
2025-05-22	Refusal Direction is Universal Across Safety-Aligned Languages	Xinpeng Wang et.al.	2505.17306	null
2025-05-22	In-Context Watermarks for Large Language Models	Yepeng Liu et.al.	2505.16934	null
2025-05-22	When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques	Jianing Geng et.al.	2505.16765	null
2025-05-23	Finetuning-Activated Backdoors in LLMs	Thibaud Gloaguen et.al.	2505.16567	link
2025-05-22	Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models	Zhaoxin Wang et.al.	2505.16446	null
2025-05-22	Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers	Viet-Anh Nguyen et.al.	2505.16241	null
2025-05-22	SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning	Kaiwen Zhou et.al.	2505.16186	null
2025-05-21	Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval	Taiye Chen et.al.	2505.15753	null
2025-05-21	Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses	Xiaoxue Yang et.al.	2505.15738	link
2025-05-21	Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries	Yuhao Wang et.al.	2505.15420	null
2025-05-21	Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models	Zirui Song et.al.	2505.15406	link
2025-05-20	SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment	Wonje Jeung et.al.	2505.14667	null
2025-05-20	sudoLLM : On Multi-role Alignment of Language Models	Soumadeep Saha et.al.	2505.14607	null
2025-05-20	Can Large Language Models Really Recognize Your Name?	Dzung Pham et.al.	2505.14549	link
2025-05-20	Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders	Agam Goyal et.al.	2505.14536	null
2025-05-20	Lessons from Defending Gemini Against Indirect Prompt Injections	Chongyang Shi et.al.	2505.14534	null
2025-05-20	Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs	Jiawen Wang et.al.	2505.14368	null
2025-05-20	Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion	Tiehan Cui et.al.	2505.14316	null
2025-05-20	EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection	Yijie Lu et.al.	2505.14289	null
2025-05-20	"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs	Darpan Aswal et.al.	2505.14226	null
2025-05-20	AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models	Guangke Chen et.al.	2505.14103	null
2025-05-19	Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks	Narek Maloyan et.al.	2505.13348	null
2025-05-19	I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models	Alice Plebe et.al.	2505.13302	link
2025-05-19	The Hidden Dangers of Browsing AI Agents	Mykyta Mudryi et.al.	2505.13076	null
2025-05-18	BadNAVer: Exploring Jailbreak Attacks On Vision-and-Language Navigation	Wenqi Lyu et.al.	2505.12443	null
2025-05-18	CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement	Gauri Kholkar et.al.	2505.12368	null
2025-05-18	The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models	Linghan Huang et.al.	2505.12287	null
2025-05-17	Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement	Peng Ding et.al.	2505.12060	link
2025-05-17	Multilingual Collaborative Defense for Large Language Models	Hongliang Li et.al.	2505.11835	link
2025-05-17	JULI: Jailbreak Large Language Models by Self-Introspection	Jesson Wang et.al.	2505.11790	null
2025-05-16	EnvInjection: Environmental Prompt Injection Attack to Multi-modal Web Agents	Xilong Wang et.al.	2505.11717	null
2025-05-16	ProxyPrompt: Securing System Prompts against Prompt Extraction Attacks	Zhixiong Zhuang et.al.	2505.11459	null
2025-05-16	CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs	Sijia Chen et.al.	2505.11413	null
2025-05-16	AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models	Jiacheng Liang et.al.	2505.10846	link
2025-05-16	LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs	Ran Li et.al.	2505.10838	null
2025-05-15	Dark LLMs: The Growing Threat of Unaligned AI Models	Michael Fire et.al.	2505.10066	null
2025-05-15	Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data	Adel ElZemity et.al.	2505.09974	null
2025-05-16	PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization	Yidan Wang et.al.	2505.09921	link
2025-05-14	Adversarial Attack on Large Language Models using Exponentiated Gradient Descent	Sajib Biswas et.al.	2505.09820	link
2025-05-14	Adversarial Suffix Filtering: a Defense Pipeline for LLMs	David Khachaturov et.al.	2505.09602	null
2025-05-11	TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis	Longtian Wang et.al.	2505.08804	null
2025-05-13	A Large-Scale Empirical Analysis of Custom GPTs' Vulnerabilities in the OpenAI Ecosystem	Sunday Oyinlola Ogundoyin et.al.	2505.08148	link
2025-05-12	Concept-Level Explainability for Auditing & Steering LLM Responses	Kenza Amara et.al.	2505.07610	link
2025-05-12	One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models	Haoran Gu et.al.	2505.07167	null
2025-05-10	Jailbreaking the Text-to-Video Generative Models	Jiayang Liu et.al.	2505.06679	null
2025-05-10	Practical Reasoning Interruption Attacks on Reasoning Large Language Models	Yu Cui et.al.	2505.06643	null
2025-05-10	Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model	Xinyue Lou et.al.	2505.06538	link
2025-05-10	System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection	Jiawei Guo et.al.	2505.06493	null
2025-05-08	Defending against Indirect Prompt Injection by Instruction Detection	Tongyu Wen et.al.	2505.06311	link
2025-05-09	AgentXploit: End-to-End Redteaming of Black-Box AI Agents	Zhun Wang et.al.	2505.05849	null
2025-05-12	LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities	Kalyan Nakka et.al.	2505.05619	link
2025-05-07	Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs	Chetan Pathade et.al.	2505.04806	null
2025-05-07	Safeguard-by-Development: A Privacy-Enhanced Development Paradigm for Multi-Agent Collaboration Systems	Jian Cui et.al.	2505.04799	null
2025-05-07	A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models	Pedro Pinacho-Davidson et.al.	2505.04784	null
2025-05-07	The Aloe Family Recipe for Open and Specialized Healthcare LLMs	Dario Garcia-Gasulla et.al.	2505.04388	null
2025-05-07	Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety	Variath Madhupal Gautham Nair et.al.	2505.04146	null
2025-05-06	LlamaFirewall: An open source guardrail system for building secure AI agents	Sahana Chennabasappa et.al.	2505.03574	null
2025-05-03	Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs	Haoming Yang et.al.	2505.02862	null
2025-05-04	Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents	Christian Schroeder de Witt et.al.	2505.02077	null
2025-05-05	Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System	Sheikh Samit Muhaimin et.al.	2505.01315	null
2025-05-01	OET: Optimization-based prompt injection Evaluation Toolkit	Jinsheng Pan et.al.	2505.00843	link
2025-05-05	The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)	Zihao Wang et.al.	2505.00626	null
2025-04-29	HyPerAlign: Hypotheses-driven Personalized Alignment	Cristina Garbacea et.al.	2505.00038	null
2025-04-30	XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs	Marco Arazzi et.al.	2504.21700	null
2025-04-30	Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs	Pan Suo et.al.	2504.21680	null
2025-04-30	The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning	Siyi Chen et.al.	2504.21307	null
2025-04-29	CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks	Rui Wang et.al.	2504.21228	null
2025-04-29	ACE: A Security Architecture for LLM-Integrated App Systems	Evan Li et.al.	2504.20984	null
2025-04-29	AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security	Zikui Cai et.al.	2504.20965	link
2025-04-29	Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption	Wenxiao Wang et.al.	2504.20769	null
2025-04-29	Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression	Yu Cui et.al.	2504.20493	null
2025-04-29	Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction	Yulin Chen et.al.	2504.20472	null
2025-04-29	Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems	Shiqian Zhao et.al.	2504.20376	null
2025-04-28	Prompt Injection Attack to Tool Selection in LLM Agents	Jiawen Shi et.al.	2504.19793	null
2025-04-29	Security Steerability is All You Need	Itay Hazan et.al.	2504.19521	null
2025-04-28	JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift	Julien Piet et.al.	2504.19440	link
2025-04-27	Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling	Ishan Kavathekar et.al.	2504.19277	link
2025-04-26	Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs	Mohammad Akbar-Tajari et.al.	2504.19019	link
2025-04-22	WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks	Ivan Evtimov et.al.	2504.18575	link
2025-04-25	Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections	Narek Maloyan et.al.	2504.18333	null
2025-04-23	Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate	Senmao Qi et.al.	2504.16489	null
2025-04-20	Breaking the Prompt Wall (I): A Real-World Case Study of Attacking ChatGPT via Lightweight Prompt Injection	Xiangyu Chang et.al.	2504.16125	null
2025-04-26	T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models	Siyuan Liang et.al.	2504.15512	null
2025-04-21	MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning	Yahan Yang et.al.	2504.15241	null
2025-04-20	Prompt-Hacking: The New p-Hacking?	Thomas Kosch et.al.	2504.14571	null
2025-04-20	LLM-Enabled In-Context Learning for Data Collection Scheduling in UAV-assisted Sensor Networks	Yousef Emami et.al.	2504.14556	null
2025-04-25	Manipulating Multimodal Agents via Cross-Modal Prompt Injection	Le Wang et.al.	2504.14348	null
2025-04-18	DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification	Yu Li et.al.	2504.13562	null
2025-04-15	X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents	Salman Rahman et.al.	2504.13203	null
2025-04-15	Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI	Jirui Yang et.al.	2504.13201	null
2025-04-17	GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms	Sinan He et.al.	2504.13052	null
2025-04-17	ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition	Haidar Khan et.al.	2504.12562	link
2025-04-14	You've Changed: Detecting Modification of Black-Box Large Language Models	Alden Dima et.al.	2504.12335	null
2025-04-15	DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks	Yupei Liu et.al.	2504.11358	link
2025-04-16	Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails	William Hackett et.al.	2504.11168	null
2025-04-15	Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models	Jiangtao Liu et.al.	2504.11106	null
2025-04-14	The Jailbreak Tax: How Useful are Your Jailbreak Outputs?	Kristina Nikolić et.al.	2504.10694	link
2025-04-14	Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding	Tao Zhang et.al.	2504.10465	link
2025-04-16	LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks	Soumyadeep Pal et.al.	2504.10185	link
2025-04-14	RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability	Yichi Zhang et.al.	2504.10081	null
2025-04-14	StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models	Yang Feng et.al.	2504.09841	null
2025-04-13	The Structural Safety Generalization Problem	Julius Broomfield et.al.	2504.09712	link
2025-04-13	Mitigating Many-Shot Jailbreaking	Christopher M. Ackerman et.al.	2504.09604	null
2025-04-13	ControlNET: A Firewall for RAG-based LLM System	Hongwei Yao et.al.	2504.09593	null
2025-04-13	AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender	Weixiang Zhao et.al.	2504.09466	null
2025-04-13	SaRO: Enhancing LLM Safety through Reasoning-based Alignment	Yutao Mou et.al.	2504.09420	null
2025-04-12	Feature-Aware Malicious Output Detection and Mitigation	Weilong Dong et.al.	2504.09191	null
2025-04-10	Geneshift: Impact of different scenario shift on Jailbreaking LLM	Tianyi Wu et.al.	2504.08104	null
2025-04-10	Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge	Riccardo Cantini et.al.	2504.07887	link
2025-04-10	Defense against Prompt Injection Attacks via Mixture of Encodings	Ruiyi Zhang et.al.	2504.07467	link
2025-04-09	Bypassing Safety Guardrails in LLMs Using Humor	Pedro Cisneros-Velarde et.al.	2504.06577	null
2025-04-08	Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking	Junxi Chen et.al.	2504.05838	link
2025-04-08	Separator Injection Attack: Uncovering Dialogue Biases in Large Language Models Caused by Role Separators	Xitao Li et.al.	2504.05689	null
2025-04-08	Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking	Yu-Hang Wu et.al.	2504.05652	link
2025-04-07	A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models	Carlos Peláez-González et.al.	2504.04976	null
2025-04-08	Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models	Yubo Li et.al.	2504.04717	link
2025-04-06	StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation	Shenyang Liu et.al.	2504.04373	null
2025-04-08	JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model	Yi Nian et.al.	2504.03770	link
2025-04-03	More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment	Yifan Wang et.al.	2504.02193	null
2025-04-02	Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses	Zhengchun Shang et.al.	2504.02080	null
2025-04-02	Representation Bending for Large Language Model Safety	Ashkan Yousefpour et.al.	2504.01550	link
2025-04-02	LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution	Zhuoran Yang et.al.	2504.01533	null
2025-04-07	PiCo: Jailbreaking Multimodal Large Language Models via $\textbf{Pi}$ctorial $\textbf{Co}$ de Contextualization	Aofan Liu et.al.	2504.01444	null
2025-04-07	Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks	Jiawei Wang et.al.	2504.01308	link
2025-04-02	Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning	Si Chen et.al.	2504.01278	null
2025-04-01	Multilingual and Multi-Accent Jailbreaking of Audio LLMs	Jaechul Roh et.al.	2504.01094	null
2025-04-01	Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics	Shide Zhou et.al.	2504.00446	null
2025-03-31	Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms	Shuoming Zhang et.al.	2503.24191	null
2025-03-29	Encrypted Prompt: Securing LLM Applications Against Unauthorized Actions	Shih-Han Chan et.al.	2503.23250	null
2025-03-27	Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing	Johan Wahréus et.al.	2503.21598	null
2025-03-27	Harnessing Chain-of-Thought Metadata for Task Routing and Adversarial Prompt Detection	Ryan Marinelli et.al.	2503.21464	link
2025-03-26	Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy	Joonhyun Jeong et.al.	2503.20823	link
2025-03-26	Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models	Shih-Wen Ke et.al.	2503.20320	null
2025-03-26	sudo rm -rf agentic_security	Sejin Lee et.al.	2503.20279	link
2025-03-24	MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks	Wenhao You et.al.	2503.19134	null
2025-03-23	SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment	Ruoxi Cheng et.al.	2503.18991	null
2025-03-24	Defeating Prompt Injections by Design	Edoardo Debenedetti et.al.	2503.18813	null
2025-03-23	Metaphor-based Jailbreaking Attacks on Text-to-Image Models	Chenyu Zhang et.al.	2503.17987	null
2025-03-23	Smoke and Mirrors: Jailbreaking LLM-based Code Generation via Implicit Malicious Prompts	Sheng Ouyang et.al.	2503.17953	null

(back to top)

Code Embedding

Publish Date	Title	Authors	PDF	Code
2025-07-21	AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming	Jierui Li et.al.	2507.15378	null
2025-07-16	When Retriever Meets Generator: A Joint Model for Code Comment Generation	Tien P. T. Le et.al.	2507.12558	null
2025-07-07	Unified Framework for Quantum Code Embedding	Andrew C. Yuan et.al.	2507.05361	null
2025-05-27	Semi-supervised Clustering Through Representation Learning of Large-scale EHR Data	Linshanshan Wang et.al.	2505.20731	null
2025-05-19	Towards A Generalist Code Embedding Model Based On Massive Data Synthesis	Chaofan Li et.al.	2505.12697	link
2025-05-31	Improving the Context Length and Efficiency of Code Retrieval for Tracing Security Vulnerability Fixes	Xueqing Liu et.al.	2503.22935	null
2025-07-17	OASIS: Order-Augmented Strategy for Improved Code Search	Zuchen Gao et.al.	2503.08161	null
2025-03-10	Assessing Uncertainty in Stock Returns: A Gaussian Mixture Distribution-Based Method	Yanlong Wang et.al.	2503.06929	null
2025-06-02	LoRACode: LoRA Adapters for Code Embeddings	Saumya Chaturvedi et.al.	2503.05315	null
2025-03-07	Extended Controllability Tests for Quantum Decoherence-Free Subspaces	Eric B. Kopp et.al.	2503.05155	null
2025-02-21	GNN-Coder: Boosting Semantic Code Retrieval with Combined GNNs and Transformer	Yufan Ye et.al.	2502.15202	null
2025-03-16	Poisoned Source Code Detection in Code Models	Ehab Ghannoum et.al.	2502.13459	null
2025-02-07	EnseSmells: Deep ensemble and programming language models for automated code smells detection	Anh Ho et.al.	2502.05012	link
2025-03-26	Intelligent Code Embedding Framework for High-Precision Ransomware Detection via Multimodal Execution Path Analysis	Levi Gareth et.al.	2501.15836	null
2024-12-18	Transducer Tuning: Efficient Model Adaptation for Software Tasks Using Code Property Graphs	Imam Nur Bani Yusuf et.al.	2412.13467	link

(back to top)

Model Context Protocol

Publish Date	Title	Authors	PDF	Code
2025-07-08	Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms	Tarek Gasmi et.al.	2507.06323	null
2025-07-05	We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems	Zhihao Li et.al.	2507.06250	null
2025-06-27	Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis	Rafi Al Attrach et.al.	2507.01053	null
2025-07-01	VTS-Guided AI Interaction Workflow for Business Insights	Sun Ding et.al.	2507.00347	null
2025-06-30	A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis	Zhiwei Lin et.al.	2506.23474	null
2025-06-29	From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows	Mohamed Amine Ferrag et.al.	2506.23260	null
2025-06-18	RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments	Yuchuan Fu et.al.	2506.15253	link
2025-06-08	Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values	Nell Watson et.al.	2506.13774	null
2025-06-20	Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers	Mohammed Mehedi Hasan et.al.	2506.13538	link
2025-06-12	QuantMCP: Grounding Large Language Models in Verifiable Financial Reality	Yifan Zeng et.al.	2506.06622	null
2025-05-26	Survey of LLM Agent Communication with MCP: A Software Design Pattern Centric Review	Anjana Sarkar et.al.	2506.05364	null
2025-06-05	Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem	Hao Song et.al.	2506.02040	link
2025-06-02	ETDI: Mitigating Tool Squatting and Rug Pull Attacks in Model Context Protocol (MCP) by using OAuth-Enhanced Tool Definitions and Policy-Based Access Control	Manish Bhatt et.al.	2506.01333	null
2025-05-30	Chances and Challenges of the Model Context Protocol in Digital Forensics and Incident Response	Jan-Niclas Hilgert et.al.	2506.00274	null
2025-05-27	ADA: Automated Moving Target Defense for AI Workloads via Ephemeral Infrastructure-Native Rotation in Kubernetes	Akram Sheriff et.al.	2505.23805	null
2025-05-29	MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment	John Halloran et.al.	2505.23634	null
2025-05-28	AgentDNS: A Root Domain Naming System for LLM Agents	Enfang Cui et.al.	2505.22368	null
2025-05-23	Gaming Tool Preferences in Agentic LLMs	Kazem Faghih et.al.	2505.18135	link
2025-05-22	Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models	Junjie Xiong et.al.	2505.16957	null
2025-05-16	MPMA: Preference Manipulation Attack Against Model Context Protocol	Zihan Wang et.al.	2505.11154	null
2025-05-06	From Glue-Code to Protocols: A Critical Analysis of A2A and MCP Integration for Scalable Agent Systems	Qiaomu Li et.al.	2505.03864	null
2025-05-23	A survey of agent interoperability protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP)	Abul Ehtesham et.al.	2505.02279	null
2025-04-28	Simplified and Secure MCP Gateways for Enterprise AI Integration	Ivo Brett et.al.	2504.19997	link
2025-04-28	Securing GenAI Multi-Agent Systems Against Tool Squatting: A Zero Trust Registry-Based Approach	Vineeth Sai Narajala et.al.	2504.19951	null
2025-04-28	From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review	Mohamed Amine Ferrag et.al.	2504.19678	null
2025-05-02	Building A Secure Agentic AI Application Leveraging A2A Protocol	Idan Habler et.al.	2504.16902	null
2025-05-19	MCP Guardian: A Security-First Layer for Safeguarding MCP-Based AI System	Sonu Kumar et.al.	2504.12757	null
2025-04-11	MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers	Arash Ahmadi et.al.	2504.08999	null
2025-05-02	Enterprise-Grade Security for the Model Context Protocol (MCP): Frameworks and Mitigation Strategies	Vineeth Sai Narajala et.al.	2504.08623	null
2025-04-11	MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits	Brandon Radosevich et.al.	2504.03767	link
2025-04-06	Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions	Xinyi Hou et.al.	2503.23278	null

(back to top)

Supply Chain Attacks

Publish Date	Title	Authors	PDF	Code
2025-06-24	FuncVul: An Effective Function Level Vulnerability Detection Model using LLM and Code Chunk	Sajal Halder et.al.	2506.19453	null
2025-05-30	When GPT Spills the Tea: Comprehensive Assessment of Knowledge File Leakage in GPTs	Xinyue Shen et.al.	2506.00197	null
2025-07-15	Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems	Ronny Ko et.al.	2505.23847	null
2025-05-27	JavaSith: A Client-Side Framework for Analyzing Potentially Malicious Extensions in Browsers, VS Code, and NPM Packages	Avihay Cohen et.al.	2505.21263	null
2025-06-30	LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries	Zekun Wu et.al.	2505.08842	null
2025-05-07	Safeguard-by-Development: A Privacy-Enhanced Development Paradigm for Multi-Agent Collaboration Systems	Jian Cui et.al.	2505.04799	null
2025-05-02	A Rusty Link in the AI Supply Chain: Detecting Evil Configurations in Model Repositories	Ziqi Ding et.al.	2505.01067	null
2025-04-29	Understanding Large Language Model Supply Chain: Structure, Domain, and Vulnerabilities	Yanzhe Hu et.al.	2504.20763	null
2025-04-24	Automatically Generating Rules of Malicious Software Packages via Large Language Model	XiangRui Zhang et.al.	2504.17198	null
2025-03-27	Malicious and Unintentional Disclosure Risks in Large Language Models for Code Generation	Rafiqul Rabin et.al.	2503.22760	null
2025-05-26	The CodeInverter Suite: Control-Flow and Data-Mapping Augmented Binary Decompilation with LLMs	Peipei Liu et.al.	2503.07215	null
2025-02-18	SoK: Understanding Vulnerabilities in the Large Language Model Supply Chain	Shenao Wang et.al.	2502.12497	null
2025-01-31	Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities	Arjun Krishna et.al.	2501.19012	null
2024-12-26	Integrating Artificial Open Generative Artificial Intelligence into Software Supply Chain Security	Vasileios Alevizos et.al.	2412.19088	null
2024-12-23	Emerging Security Challenges of Large Language Models	Herve Debar et.al.	2412.17614	null
2024-12-22	Enhancing Supply Chain Transparency in Emerging Economies Using Online Contents and LLMs	Bohan Jin et.al.	2412.16922	null
2024-12-18	RAG for Effective Supply Chain Security Questionnaire Automation	Zaynab Batool Reza et.al.	2412.13988	null
2025-03-30	Data Extraction Attacks in Retrieval-Augmented Generation via Backdoors	Yuefeng Peng et.al.	2411.01705	null
2024-11-03	Large Language Model Supply Chain: Open Problems From the Security Perspective	Qiang Hu et.al.	2411.01604	null

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 2,650 Commits
.github		.github
assets		assets
docs		docs
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
daily_arxiv.py		daily_arxiv.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Updated on 2026.05.04

Model Security

Prompt Injection

Code Embedding

Model Context Protocol

Supply Chain Attacks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Updated on 2026.05.04

Model Security

Prompt Injection

Code Embedding

Model Context Protocol

Supply Chain Attacks

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages