In this article we are going to delve into the second part of the lexing which is tokenizing more advanced part of the input. More precisely, we are going to lex spaces (whitespaces, new lines, ...), comments, Identifiers, Numbers (Int and Float), and Strings. At the end of this article, we will have a fully functioning lexer that can tokenize the descriptor.proto which is the longest proto file in the protobuf repo. Let's get started.