MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Zhao, Haozhe; Cai, Zefan; Si, Shuzheng; Ma, Xiaojian; An, Kaikai; Chen, Liang; Liu, Zixuan; Wang, Sheng; Han, Wenjuan; Chang, Baobao

Computer Science > Computation and Language

arXiv:2309.07915 (cs)

[Submitted on 14 Sep 2023 (v1), last revised 20 Mar 2024 (this version, v3)]

Title:MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Authors:Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang

View PDF HTML (experimental)

Abstract:Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context. Our code, dataset, dataset tool, and model are available at this https URL

Comments:	Accepted by ICLR2024
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2309.07915 [cs.CL]
	(or arXiv:2309.07915v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2309.07915

Submission history

From: HaoZhe Zhao [view email]
[v1] Thu, 14 Sep 2023 17:59:17 UTC (17,919 KB)
[v2] Mon, 2 Oct 2023 14:46:01 UTC (40,125 KB)
[v3] Wed, 20 Mar 2024 16:17:02 UTC (43,479 KB)

Computer Science > Computation and Language

Title:MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators