Dataset First of all, we need a dataset. We could use the Reddit API but it has quite a small number of posts you can retrieve. Luckily, you can find a dump of everything from Reddit at files.pushshift.io/reddit. Let’s download a few datasets: wget https://files.pushshift.io/reddit/submissions/RS_2020-02.zst wget https://files.pushshift.io/reddit/submissions/RS_2020-03.zst Next, we need to read the data and select only subreddits and columns we’re interested in. Every dataset takes a lot ...