Furu Wei on LinkedIn: Kosmos-2: Grounding Multimodal Large Language Models to the World

Furu Wei’s Post

Partner Research Manager at Microsoft Research Asia

Introducing KOSMOS-2, a Multimodal Large Language Model (MLLM) built on top of KOSMOS-1, enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. - In addition to existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), KOSMOS-2 integrates the multimodal grounding and referring capability into downstream applications. - KOSMOS-2 lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. More information about our research at https://aka.ms/GeneralAI https://lnkd.in/dzqtehN6

Kosmos-2: Grounding Multimodal Large Language Models to the World

arxiv.org

To view or add a comment, sign in

Learn from experts on LinkedIn

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Furu Wei’s Post

Explore topics