Visual Abilities of Language Models Found Lacking Depth

Researchers from Auburn University and the University of Alberta have discovered that the visual skills of large language models (LLMs) with vision capabilities—termed visual language models (VLMs)—are overstated. Their study, recently posted on arXiv, tested popular VLMs including GPT-4o, Gemini-1.5 Pro, and Claude-3 models, revealing significant limitations in their visual processing abilities.

Key Findings

The study highlighted that while VLMs can process visual input, their ability to interpret and analyze this data remains rudimentary. For instance, these models can identify objects in images but struggle with more complex tasks like counting or understanding relationships between elements.

An example demonstrated this shortcoming: when asked how many children in front of the Taj Mahal were holding hands, the models failed to provide accurate answers. Their training has focused more on recognizing objects than on performing tasks that require counting or analyzing interactions.

Performance on Visual Tasks

The researchers tasked the VLMs with counting overlapping circles and interconnected rings—tasks that are relatively straightforward for humans. The models performed poorly unless they had previously encountered similar visual examples. For instance, they struggled with counting interlocking rings beyond five unless they had learned from familiar instances, such as the Olympic rings.

This research suggests that VLMs, despite advances in their capabilities, still have significant limitations in processing visual information similarly to human perception. The findings indicate a need for further development in the visual processing aspects of these models to improve their utility in real-world applications.

Previous Post Next Post