VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Xiamen University
*Equal Contribution Corresponding Author

Abstract

The swift progress of Multi-modal Large Models (MLLMs) has showcased their impressive ability to tackle tasks blending vision and language. Yet, most current models and benchmarks cater to scenarios with a narrow scope of visual and textual contexts. These models often fall short when faced with complex comprehension tasks, which involve navigating through a plethora of irrelevant and potentially misleading information in both text and image forms. To bridge this gap, we introduce a new, more demanding task known as Interleaved Image-Text Comprehension (IITC). This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions and to follow intricate instructions to pinpoint the relevant image. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA), to refine image-text correlation skills. Our evaluation of four leading closed-source models, as well as various open-source models using VEGA, underscores the rigorous nature of IITC. Even the most advanced models, such as Gemini-1.5-pro and GPT4V, only achieved modest success. By employing a multi-task, multi-scale post-training strategy, we have set a robust baseline for MLLMs on the IITC task, attaining an 85.8% accuracy rate in image association and a 0.508 Rouge score. These results validate the effectiveness of our dataset in improving MLLMs capabilities for nuanced image-text comprehension.

Introduction


Comparison between existing VQA tasks and our IITC task.

  • Left: The input for existing VQA tasks only incorporates a limited amount of image and text data, which is highly relevant to the question.
  • Right: The input for the IITC task contains longer images and text information, which includes redundant and misleading data. The model needs to specify the reference image when providing an answer.

To evaluate and enhance model performance on the IITC task, we have created the VEGA Dataset, which focuses on the comprehension of scientific papers and contains over 50,000 scientific articles. The VEGA dataset is structured into two subsets, each specifically curated to train models on the IITC and ITA tasks, respectively. The ITA task is a subtask designed to support the training of the IITC task. We fine-tuned the Qwen-VL-Chat model on the VEGA dataset using a multi-scale, multi-task training strategy, resulting in the VEGA-Base model. Our experimental results show that this model achieves an image association accuracy rate of 85.8%, which is a substantial improvement and a strong baseline for the IITC task.

The task definition of IITC and ITA tasks. (a) The IITC task takes long interleaved image-text content as input and requires the model to specify the image it refers to in its response. (b) The ITA task takes shuffled images and text segments from different articles as input and requires the model to output the relationship between the text and the images. <Text *> and <Image *> represent a text segment and an image, respectively. They are both tokenized and fed into the model along with the task prompt and the question.

Examples of VEGA Dataset

All papers in VEGA are from arxiv.

IITC Case 1
IITC Case 1
IITC Case 2
IITC Case 2
Image 3
ITA Case 1

Specific Cases of VEGA-Base

All papers in VEGA are from arxiv.

IITC Case 1
IITC Case 1
IITC Case 2
IITC Case 2

BibTeX


TODO