This is a method of evaluating strategies for the multi-armed bandit problem 1. The testbed works as follows: Generate $10$ reward means $\mu_i$ associated with $10$ actions $a_i$ On each iteration allow the agent to take some action $a_j$, and receive a reward $r_t \sim \mathcal N(\mu_j, 1)$. We repeat this for $100$ randomly sampled sets of $\mu_i$. The agent’s goal is to maximize average rewards. Hopefully, it should learn which action has the highest mean and sample from that.