With last week’s Pixtral release, multimodal large language models (LLMs) like OpenAI’s GPT-4o, Google’s Gemini Pro, and Pixtral are making significant strides. These models are not only able to generate text from images but are often touted for their ability to “see” images similarly to human perception. But how true is this claim? Update (2024-09-18): I added a new test for Qwen2-VL, which still fails to generate the correct ASCII art.