Here is a fun explanation why neural networks do better when trained with noise. There are multiple existing explanations, but I particularly like this one. First, an aside: Given 100 flips of an unfair coin with 60% probability of heads, the single-most-likely sequence is 100 consecutive heads, but a typical sequence will have about 60 heads. “Likely” and “typical” are two different things, and often what you really want is the “typical”. Let’s apply this idea to neural network...