AI Guardrails Index:

We broke AI guardrails down to six categories.

We curated datasets and models that demonstrate the state of AI safety using LLMs and other open source models.

Introduction

Jailbreaking LLMs involves manipulating LLMs to bypass safety measures, producing restricted or harmful content. This poses significant risks across domains, from finance to national security. Jailbreak Detection guardrails mitigate these risks by identifying and blocking such attempts, ensuring responsible AI deployment and protecting users and organizations.

Results

Leaderboard

Metric:

Task:

Developer	Model	Latency	Metric
Anthropic	Claude 3 Haiku	1.8267 ms	0.8101
Guardrails AI	Detect Jailbreak	0.0527 ms	0.8118
jackhhao	llm_warden	0.0119 ms	0.7070
Meta	Llama Prompt Guard 86M	0.0515 ms	0.6663
Microsoft	AI Content Safety Prompt Shields	0.0971 ms	0.7331
zhx123	ftrobertallm	0.0267 ms	0.7398

Dataset Breakdown

Developer	Samples
Roleplay/Pretend/Hypothetical	1124
Resistance Suppression	792
Permission-Granting	447
Roleplay/Pretend/Hypothetical (DAN attack)	242
Prompt Continuation or Saturation	72
Character-Gradient Attack	60
Prompt Saturation	59
Character-Gradient Attack (Adversarial Noise)	45
Prompt Obfuscation	45
Prompt Obfuscation (Program Execution)	22
Character-Gradient Attack (Special System Tokens)	20
Prompt Continuation	20
Character-Gradient Attack (Glitch Tokens)	2

See the full dataset here: Jailbreaking dataset

Conclusion

For jailbreak detection, Guardrails AI emerges as the top performer of all guardrails tested, offering an optimal balance between security and usability with its high Max F1 score and well-balanced true positive and true negative rates. For high-stakes scenarios, Anthropic's model excels in threat detection but at the cost of decreased usability or increased manual validation with the significant increase of false positives. Guardrails AI's versatile performance options, with fast GPU latency for time-sensitive applications and cost-effective CPU deployment for less critical tasks, further solidify its position as the leading solution. While other models like zhx123 offer lower latency, they compromise on security performance. The choice of guardrail ultimately depends on specific use cases, balancing the trade-offs between security, usability, and performance requirements.