How should we study datasets in machine learning? As machine learning (ML) increasingly becomes a site of sociotechnical inquiry, invoking numerous social, political, legal, and ethical issues, datasets are a crucial component as they are core material used to train models. Inspired by Tarleton Gillespie and Nick Seaver’s Critical Algorithm Studies reading list, this collection is meant to serve as an entry point to the growing literature on ML datasets across the fields of computer science...| knowingmachines.org
LAION-5B is an open-source foundation dataset. It contains 5.8 billion image and text pairs—a size too large to make sense of. We follow the construction of the dataset to better understand its contents, implications and entanglements.| knowingmachines.org