Most LLMs Are Failing Key Real-World Safety Tests. Here's the Data

(medium.com)

2 points | by gyanveda 12 hours ago ago

1 comments

gyanveda 12 hours ago ago
We tested 20 of the most popular LLMs against 10 real-world risks, including:
- Privacy & Impersonation
- Unqualified Professional Advice
- Child & Animal Abuse
- Misinformation
What we found:
- Anthropic's Claude Haiku 3.5 was the safest, scoring 86% (others dropped as low as 52%)
- Privacy & Impersonation were the top failure points, with some models failing 91% of the time
- Most models performed best on misinformation, hate speech, and malicious use
- No model is 100% safe, but Anthropic, OpenAI, Amazon, and Google consistently outperform peers
We built this matrix (and dev tools to build your own) to help teams measure AI risk more easily.