Ever wondered if AI models like GPT-4o and Gemini-1.5 Pro can really see?
Naomi | Beeyond AI
2 replies
Are Vision Language Models (VLMs) Actually Blind?
Turns out, they might not be as good at it as we thought.
A research paper from the University of Alberta reveals some surprising flaws in Vision Language Models (VLMs) like GPT-4o and Gemini-1.5 Pro. Despite their high scores, these AI models struggle with simple visual tasks such as spotting overlapping shapes, counting objects, and recognizing circled letters. This shows they don't really understand visuals like we do.
Check out the full details and experiments in the research paper here: https://vlmsareblind.github.io/
What other tasks do you think VLMs or LLMs might struggle with?
Replies
Juntaro Matsumoto@juntaro_matsumoto
As a Stable Diffusion freak, I always struggle vision diffusion models to capture human hands and fingers. Both on generation and on detection by transformers. Even when I ask a simple task like "count the number of fingers you see from the model", it doesn't work out.
Share
@juntaro_matsumoto Indeed, it's a long way before VLMs can begin to do process images the way humans do.