I recently proposed a new research project idea: let’s take all of GitHub (or <insert your preferred VCS host>) and create a multi-language (even partially language-agnostic) concrete syntax tree of all the code so that we can do some otherwise impossibly difficult further research and answer incredibly complex questions.