This post breaks down how LLMs are tested, which benchmarks matter and what the scores mean to figure out which model fits your needs.| RisingStack Engineering
AI researchers already use a range of evaluation benchmarks to identify unwanted behaviours in AI systems, such as AI systems making misleading statements, biased decisions, or repeating...| Google DeepMind
Meta AI announces Purple Llama, a project for open trust and safety in generative AI, with tools for cybersecurity and input/output filtering.| ai.meta.com
We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K| crfm.stanford.edu