The swift progress of Multi-modal Large Models (MLLMs) has showcased their impressive ability to tackle tasks blending vision and language. Yet, most current models and benchmarks cater to scenarios with a narrow scope of visual and textual contexts. These models often fall short when faced with complex comprehension tasks, which involve navigating through a plethora of irrelevant and potentially misleading information in both text and image forms. To bridge this gap, we introduce a new, more demanding task known as Interleaved Image-Text Comprehension (IITC). This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions and to follow intricate instructions to pinpoint the relevant image. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA), to refine image-text correlation skills. Our evaluation of four leading closed-source models, as well as various open-source models using VEGA, underscores the rigorous nature of IITC. Even the most advanced models, such as Gemini-1.5-pro and GPT4V, only achieved modest success. By employing a multi-task, multi-scale post-training strategy, we have set a robust baseline for MLLMs on the IITC task, attaining an 85.8% accuracy rate in image association and a 0.508 Rouge score. These results validate the effectiveness of our dataset in improving MLLMs capabilities for nuanced image-text comprehension.
Comparison between existing VQA tasks and our IITC task.
To evaluate and enhance model performance on the IITC task, we have created the VEGA Dataset, which focuses on the comprehension of scientific papers and contains over 50,000 scientific articles. The VEGA dataset is structured into two subsets, each specifically curated to train models on the IITC and ITA tasks, respectively. The ITA task is a subtask designed to support the training of the IITC task. We fine-tuned the Qwen-VL-Chat model on the VEGA dataset using a multi-scale, multi-task training strategy, resulting in the VEGA-Base model. Our experimental results show that this model achieves an image association accuracy rate of 85.8%, which is a substantial improvement and a strong baseline for the IITC task.
The task definition of IITC and ITA tasks. (a) The IITC task takes long interleaved image-text content as input and requires the model to specify the image it refers to in its response. (b) The ITA task takes shuffled images and text segments from different articles as input and requires the model to output the relationship between the text and the images. <Text *> and <Image *> represent a text segment and an image, respectively. They are both tokenized and fed into the model along with the task prompt and the question.
All papers in VEGA are from arxiv.
VEGA-8k* is a proprietary model developed using 8k token length VEGA data along with some in-house image-text interleaved data. This in-house data includes a wider range of image-text application scenarios, giving VEGA-8k* a more generalized capability for understanding document. VEGA-Base-4k derived from fine-tuning the Qwen-VL-Chat 7B model using the VEGA dataset with a 4k token length.
@misc{zhou2024vegalearninginterleavedimagetext,
title={VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models},
author={Chenyu Zhou and Mengdan Zhang and Peixian Chen and Chaoyou Fu and Yunhang Shen and Xiawu Zheng and Xing Sun and Rongrong Ji},
year={2024},
eprint={2406.10228},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.10228},
}