Introduction Multimodal AI is changing how we interact with large language models. In the beginning we typed in text, and got a response. Now we can upload multiple types of files to an LLM and have it parsed. Blending natural language processing and computer vision, these models can interpret text, analyze images, and make recomendations. Until recently multimodal AI was limited to hosted solutions, the “big name” tools. Services like ChatGPT, Claude, Bard, and so many others.