Skip to yearly menu bar Skip to main content


Poster

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Rohan Wadhawan · Hritik Bansal · Kai-Wei Chang · Nanyun Peng


Abstract:

In the real world, many tasks require joint reasoning over the text and visual elements in the image (e.g., navigating in public spaces), which we refer to as context-sensitive text-rich visual reasoning. Specifically, these tasks require an understanding of the context in which the text interacts with visual elements within an image. Due to the lack of existing datasets for this task, we introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images. We conduct experiments to assess the performance of 13 foundation models (GPT-4V, Gemini-Pro-Vision, LLaVA-1.5) and establish a human performance baseline. Further, we perform human evaluation of the model responses and observe a significant performance gap of 30.8% between the best-performing LMM, GPT-4V, and human performance baseline. Our fine-grained analysis reveals that GPT-4V encounters difficulties interpreting time-related data and infographics. However, it demonstrates proficiency in comprehending abstract visual contexts such as memes and quotes. Finally, our qualitative analysis uncovers various factors contributing to poor performance, including lack of precise visual perception and hallucinations. We will release the dataset and code upon acceptance.

Live content is unavailable. Log in and register to view live content