This post is aimed at readers who are already familiar with stochastic gradient descent (SGD) and terms like “batch size”. For an introduction to these ideas, I recommend Goodfellow et al.’s Deep Learning, in particular the introduction and, for more about SGD, Chapter 8. The relevance of SGD is that it has made it feasible to work with much more complex models than was formerly possible.