In this episode, we discuss LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding by Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra. LongVU presents a spatiotemporal adaptive compression method for processing long videos using Multimodal Large Language Models, efficiently reducing redundancy while preserving important visual information. It employs techniques like cross-modal queries, DINOv2 features, and token reduction to manage spatial and temporal information. This approach shows superior performance on video understanding benchmarks, handling lengthy videos effectively and demonstrating scalability even in smaller models.
Ramin Mehran’s Post
More Relevant Posts
-
In this episode, we discuss Evaluating Text-to-Visual Generation with Image-to-Text Generation by Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan. The paper introduces VQAScore, a novel metric for evaluating the alignment of generated images to text prompts, utilizing a visual-question-answering model to score the relevance of images to prompts based on a simple yes-or-no question. Unlike existing metrics, the proposed VQAScore effectively handles complex prompts, demonstrating superior performance across numerous benchmarks, even when compared to proprietary models like GPT-4V. Additionally, the paper presents GenAI-Bench, a challenging new benchmark consisting of compositional text prompts and human ratings, and provides open-source access to their data and models to facilitate further research in text-to-visual generation evaluations.
arxiv preprint - Evaluating Text-to-Visual Generation with Image-to-Text Generation
podbean.com
To view or add a comment, sign in
-
In this episode, we discuss Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution by Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin. The Qwen2-VL Series introduces Naive Dynamic Resolution for processing images of varying resolutions more efficiently and integrates Multimodal Rotary Position Embedding for improved fusion of positional information across modalities. It employs a unified approach for both images and videos, enhancing visual perception and explores scaling laws for large vision-language models by increasing model size and training data. The Qwen2-VL-72B model achieves competitive performance, rivaling top models like GPT-4o and Claude3.5-Sonnet, and surpasses other generalist models across various benchmarks.
Arxiv Paper - Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
podbean.com
To view or add a comment, sign in
-
In this episode, we discuss NVLM: Open Frontier-Class Multimodal LLMs by Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping. The paper introduces NVLM 1.0, a set of advanced multimodal large language models that achieve state-of-the-art performance on vision-language tasks and improve upon their text-only capabilities. It outlines the benefits of a novel architecture that enhances training efficiency and reasoning abilities using a 1-D tile-tagging design, emphasizing the importance of dataset quality and task diversity over scale. NVLM 1.0's models excel in multimodal and text-only tasks through the integration of high-quality data, and the model weights are released with plans to open-source the training code.
Arxiv Paper - NVLM: Open Frontier-Class Multimodal LLMs
podbean.com
To view or add a comment, sign in
-
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models Li et al.: https://lnkd.in/gTq7iSt4 #ArtificialIntelligence #DeepLearning #MachineLearning
To view or add a comment, sign in
-
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models Li et al.: https://lnkd.in/gXzgiqpr #ArtificialIntelligence #DeepLearning #MachineLearning
To view or add a comment, sign in
-
Shedding new light on intrinsic image decomposition! MLI-NeRF leverages multi-light info in Neural Radiance Fields to enhance reflectance and shading separation without ground truth. Check out its performance on diverse scenes! 💡📷 #NeRF #MachineLearning #ComputerVision
To view or add a comment, sign in
-
Technology 🆚 Technology In light of the recent controversy involving OpenAI and Scarlett Johansson, where artists are increasingly concerned about their work being stolen or misused by AI technologies, the paper mentioned below takes the first step effective solution. (paper link in the comment below) For artists worried about the unauthorised use and reproduction of their work, this paper provides a robust tool for protecting digital content. It's a fascinating read that combines theoretical insights with practical applications, making it highly relevant in today's digital age. And also, Congratulations Mayank Kumar Singh on your recent publication. Do you such technology can save artists from the technology stealing their work? 🤔 #OpenAIvScarlettJohansson #ScarlettJohansson #openai #privacy #watermarking #interspeech #artist OpenAI Sony
Our paper titled 🤫 SilentCipher: Deep Audio Watermarking (co-authored with Naoya Takahashi, Wei-Hsiang Liao and Yuki Mitsufuji, PhD) has been accepted in INTERSPEECH 2024. Please find the link to the open-sourced code and arxiv paper in the comments section! Summary: In this paper, we address artefacts introduces by Deep learning-based watermarking methods and introduce a way to remove the need for perceptual losses which leads to stable training allowing us to achieve SOTA in terms of both perceptual quality and robustness against distortion. Unlike previous methods which work on 16kHz sampling rate, we also showcase our results on 44.1kHz sampling rates opening the path for practical applications. SonyAI #interspeech2024 #sony #sonyai #audiowatermarking #watermark
To view or add a comment, sign in
-
Object detection is a technique used to identify and locate objects within an image or video. A real-time object detection system that divides the image into a grid and predicts bounding boxes and probabilities directly from full images in a single evaluation YOLO🏏🏏 YOLO means You only look once i 🚀🚀 learn from AIMER Society - Artificial Intelligence Medical and Engineering Researchers Society and Sai Satish #intenship #object detection
To view or add a comment, sign in
-
In this episode, we discuss SciMON: Scientific Inspiration Machines Optimized for Novelty by Qingyun Wang, Doug Downey, Heng Ji, Tom Hope. The paper presents SCIMON, a new framework designed to push neural language models towards generating innovative scientific ideas that are informed by existing literature, going beyond simple binary link prediction. SCIMON generates natural language hypotheses by retrieving inspirations from previous papers and iteratively refining these ideas to enhance their novelty and ensure they are sufficiently distinct from prior research. Evaluations indicate that while models like GPT-4 tend to produce ideas lacking in novelty and technical depth, the SCIMON framework is capable of overcoming some of these limitations to inspire more original scientific thinking.
arxiv preprint - SciMON: Scientific Inspiration Machines Optimized for Novelty
podbean.com
To view or add a comment, sign in
-
I am thrilled to share that our paper "Look Hear: Gaze Prediction for Speech-directed Human Attention" has been accepted for #ECCV2024! Huge congrats to my co-authors Seo-Young Ahn, Zhibo Yang, Niranjan Balasubramanian, Dimitris Samaras, Gregory Zelinsky, and Minh Hoai Nguyen! In this work, we study how humans seamlessly integrate vision and language to direct attention towards specific goals during an incremental object referral task. To predict human gaze fixations in this naturalistic multimodal search scenario, we introduce Attention in Referral Transformer (ART) - a multimodal transformer architecture which integrates vision and language modalities to generate sequences of human gaze fixations. We have also collected a high quality and large-scale dataset, which we name RefCOCO-Gaze, containing gaze fixations of humans performing our incremental object referral task. For more details on RefCOCO-Gaze, visit https://lnkd.in/g-EmZeav. Pre-print: https://lnkd.in/gZx73Jcw Stay tuned for the code!
To view or add a comment, sign in