When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasin...| arXiv.org
A new study shows that chain-of-thought (CoT) prompts only improve large language models (LLM) on very narrow planning tasks and don't generalize broadly.| TechTalks - Technology solving problems... and creating new ones
Deep reinforcement learning is one of the most interesting branches of AI, responsible for achievements such as mastering complex games, self-driving cars, and robotics.| TechTalks - Technology solving problems... and creating new ones