This post was also published in the Software Mansion blog. Large Language Models like Google’s Gemini 2.0 Flash and OpenAI’s GPT-4o Realtime are multimodal, meaning users can chat with them via text, talk to them directly like in a conversation, or even send a live video feed.