AI Guardrails Index:

We broke AI guardrails down to six categories.

We curated datasets and models that demonstrate the state of AI safety using LLMs and other open source models.

Introduction

Jailbreaking LLMs involves manipulating LLMs to bypass safety measures, producing restricted or harmful content. This poses significant risks across domains, from finance to national security. Jailbreak Detection guardrails mitigate these risks by identifying and blocking such attempts, ensuring responsible AI deployment and protecting users and organizations.

Results

Leaderboard
Metric:
Task:
DeveloperModelLatencyMetric
AnthropicClaude 3 Haiku
1.8267 ms
0.8101
Guardrails AIDetect Jailbreak
0.0527 ms
0.8118
jackhhaollm_warden
0.0119 ms
0.7070
MetaLlama Prompt Guard 86M
0.0515 ms
0.6663
MicrosoftAI Content Safety Prompt Shields
0.0971 ms
0.7331
zhx123ftrobertallm
0.0267 ms
0.7398

Dataset Breakdown

DeveloperSamples
Roleplay/Pretend/Hypothetical
1124
Resistance Suppression
792
Permission-Granting
447
Roleplay/Pretend/Hypothetical (DAN attack)
242
Prompt Continuation or Saturation
72
Character-Gradient Attack
60
Prompt Saturation
59
Character-Gradient Attack (Adversarial Noise)
45
Prompt Obfuscation
45
Prompt Obfuscation (Program Execution)
22
Character-Gradient Attack (Special System Tokens)
20
Prompt Continuation
20
Character-Gradient Attack (Glitch Tokens)
2
See the full dataset here: Jailbreaking dataset

Conclusion

For jailbreak detection, Guardrails AI emerges as the top performer of all guardrails tested, offering an optimal balance between security and usability with its high Max F1 score and well-balanced true positive and true negative rates. For high-stakes scenarios, Anthropic's model excels in threat detection but at the cost of decreased usability or increased manual validation with the significant increase of false positives. Guardrails AI's versatile performance options, with fast GPU latency for time-sensitive applications and cost-effective CPU deployment for less critical tasks, further solidify its position as the leading solution. While other models like zhx123 offer lower latency, they compromise on security performance. The choice of guardrail ultimately depends on specific use cases, balancing the trade-offs between security, usability, and performance requirements.