Introducing KOSMOS-2, a Multimodal Large Language Model (MLLM) built on top of KOSMOS-1, enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. - In addition to existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), KOSMOS-2 integrates the multimodal grounding and referring capability into downstream applications. - KOSMOS-2 lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. More information about our research at https://aka.ms/GeneralAI https://lnkd.in/dzqtehN6