Evaluating the accuracy of large language models (LLMs) on contract review tasks is critical to understanding reliability in the field. However, objectivity is a challenge when evaluating long form, free text responses to prompts.| www.screens.ai