Today, we're launching a new bug bounty program to stress-test our latest safety measures, in partnership with HackerOne. Similar to the program we announced last summer, we're challenging red-teamers to find universal jailbreaks in safety classifiers that we haven't yet deployed publicly.| www.anthropic.com
As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. To study this setting, we develop RapidResponseBench,...| arXiv.org
In this post, we are sharing what we have learned about the trajectory of potential national security risks from frontier AI models, along with some of our thoughts about challenges and best practices in evaluating these risks.| www.anthropic.com
A paper from Anthropic describing a new way to guard LLMs against jailbreaking| www.anthropic.com