I’ve been fooling around with some natural language data from OPUS, the “open parallel corpus.” This contains many gigabytes of movie subtitles, UN documents and other text, much of it tagged by part-of-speech and aligned across multiple languages. In total, there’s over 50 GB of data, compressed. “50 GB, compressed” is an awkward quantity of data: It’s large enough so that Pandas can’t suck it all into memory. It’s large enough that PostgreSQL stops being fun, and starts fe...