Topic: Testing Binary vs Score Evals on the Latest Models