Ever wondered if AI models like GPT-4o and Gemini-1.5 Pro can really see?

Question

Are Vision Language Models (VLMs) Actually Blind?
Turns out, they might not be as good at it as we thought.

A research paper from the University of Alberta reveals some surprising flaws in Vision Language Models (VLMs) like GPT-4o and Gemini-1.5 Pro. Despite their high scores, these AI models struggle with simple visual tasks such as spotting overlapping shapes, counting objects, and recognizing circled letters. This shows they don't really understand visuals like we do.

Check out the full details and experiments in the research paper here: https://vlmsareblind.github.io/

What other tasks do you think VLMs or LLMs might struggle with?

Naomi | Beeyond AI · Accepted Answer

@juntaro_matsumoto Indeed, it's a long way before VLMs can begin to do process images the way humans do.

Juntaro Matsumoto · Answer

As a Stable Diffusion freak, I always struggle vision diffusion models to capture human hands and fingers. Both on generation and on detection by transformers. Even when I ask a simple task like "count the number of fingers you see from the model", it doesn't work out.

Answer

VLMs: impressive, but still blind as a bat! 🦇 Sure, they ace tests. Just a reminder, they're still just fancy computer programs. Time for AI to get some glasses! 🤖👓 What other simple tasks do you think they struggle with? Once we said, AI wil never create "art"...it has no soul...and yet!

Ever wondered if AI models like GPT-4o and Gemini-1.5 Pro can really see?

Replies