Ramin Mehran’s Post

Tech Lead @ Google DeepMind Multi-Modal perception/generation, AI Breakdown Podcaster

1mo

In this episode, we discuss LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding by Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra. LongVU presents a spatiotemporal adaptive compression method for processing long videos using Multimodal Large Language Models, efficiently reducing redundancy while preserving important visual information. It employs techniques like cross-modal queries, DINOv2 features, and token reduction to manage spatial and temporal information. This approach shows superior performance on video understanding benchmarks, handling lengthy videos effectively and demonstrating scalability even in smaller models.

Arxiv Paper - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

podbean.com

To view or add a comment, sign in

More Relevant Posts

Ramin Mehran

Tech Lead @ Google DeepMind Multi-Modal perception/generation, AI Breakdown Podcaster
8mo
Report this post
In this episode, we discuss Evaluating Text-to-Visual Generation with Image-to-Text Generation by Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan. The paper introduces VQAScore, a novel metric for evaluating the alignment of generated images to text prompts, utilizing a visual-question-answering model to score the relevance of images to prompts based on a simple yes-or-no question. Unlike existing metrics, the proposed VQAScore effectively handles complex prompts, demonstrating superior performance across numerous benchmarks, even when compared to proprietary models like GPT-4V. Additionally, the paper presents GenAI-Bench, a challenging new benchmark consisting of compositional text prompts and human ratings, and provides open-source access to their data and models to facilitate further research in text-to-visual generation evaluations.

arxiv preprint - Evaluating Text-to-Visual Generation with Image-to-Text Generation

podbean.com
Like Comment
To view or add a comment, sign in
Ramin Mehran

Tech Lead @ Google DeepMind Multi-Modal perception/generation, AI Breakdown Podcaster
1mo
Report this post
In this episode, we discuss Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution by Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin. The Qwen2-VL Series introduces Naive Dynamic Resolution for processing images of varying resolutions more efficiently and integrates Multimodal Rotary Position Embedding for improved fusion of positional information across modalities. It employs a unified approach for both images and videos, enhancing visual perception and explores scaling laws for large vision-language models by increasing model size and training data. The Qwen2-VL-72B model achieves competitive performance, rivaling top models like GPT-4o and Claude3.5-Sonnet, and surpasses other generalist models across various benchmarks.

Arxiv Paper - Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

podbean.com

1 Comment
Like Comment
To view or add a comment, sign in
Ramin Mehran

Tech Lead @ Google DeepMind Multi-Modal perception/generation, AI Breakdown Podcaster
1mo
Report this post
In this episode, we discuss NVLM: Open Frontier-Class Multimodal LLMs by Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping. The paper introduces NVLM 1.0, a set of advanced multimodal large language models that achieve state-of-the-art performance on vision-language tasks and improve upon their text-only capabilities. It outlines the benefits of a novel architecture that enhances training efficiency and reasoning abilities using a 1-D tile-tagging design, emphasizing the importance of dataset quality and task diversity over scale. NVLM 1.0's models excel in multimodal and text-only tasks through the integration of high-quality data, and the model weights are released with plans to open-source the training code.

Arxiv Paper - NVLM: Open Frontier-Class Multimodal LLMs

podbean.com

1 Comment
Like Comment
To view or add a comment, sign in
Vincent Boucher

President of Montreal.AI and Quebec.AI
9mo
Report this post
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models Li et al.: https://lnkd.in/gTq7iSt4 #ArtificialIntelligence #DeepLearning #MachineLearning
1 Comment
Like Comment
To view or add a comment, sign in
MONTREAL.AI

4,075 followers
9mo
Report this post
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models Li et al.: https://lnkd.in/gXzgiqpr #ArtificialIntelligence #DeepLearning #MachineLearning
Like Comment
To view or add a comment, sign in
Qeios

919 followers
2w
Report this post
Shedding new light on intrinsic image decomposition! MLI-NeRF leverages multi-light info in Neural Radiance Fields to enhance reflectance and shading separation without ground truth. Check out its performance on diverse scenes! 💡📷 #NeRF #MachineLearning #ComputerVision

MLI-NeRF: Multi-Light Intrinsic-Aware Neural Radiance Fields

qeios.com
Like Comment
To view or add a comment, sign in
Pragya Kumari

Computer Scientist at Adobe
6mo
Report this post
Technology 🆚 Technology In light of the recent controversy involving OpenAI and Scarlett Johansson, where artists are increasingly concerned about their work being stolen or misused by AI technologies, the paper mentioned below takes the first step effective solution. (paper link in the comment below) For artists worried about the unauthorised use and reproduction of their work, this paper provides a robust tool for protecting digital content. It's a fascinating read that combines theoretical insights with practical applications, making it highly relevant in today's digital age. And also, Congratulations Mayank Kumar Singh on your recent publication. Do you such technology can save artists from the technology stealing their work? 🤔 #OpenAIvScarlettJohansson #ScarlettJohansson #openai #privacy #watermarking #interspeech #artist OpenAI Sony

Mayank Kumar Singh

Research Scientist at Sony Research Japan | IIT Bombay Elec. Eng. 2020
6mo Edited

Our paper titled 🤫 SilentCipher: Deep Audio Watermarking (co-authored with Naoya Takahashi, Wei-Hsiang Liao and Yuki Mitsufuji, PhD) has been accepted in INTERSPEECH 2024. Please find the link to the open-sourced code and arxiv paper in the comments section! Summary: In this paper, we address artefacts introduces by Deep learning-based watermarking methods and introduce a way to remove the need for perceptual losses which leads to stable training allowing us to achieve SOTA in terms of both perceptual quality and robustness against distortion. Unlike previous methods which work on 16kHz sampling rate, we also showcase our results on 44.1kHz sampling rates opening the path for practical applications. SonyAI #interspeech2024 #sony #sonyai #audiowatermarking #watermark

1 Comment
Like Comment
To view or add a comment, sign in
PATAN ASMATHULLA KHAN

.Computer Science Student | Passionate about AI & Machine Learning
6mo
Report this post
Object detection is a technique used to identify and locate objects within an image or video. A real-time object detection system that divides the image into a grid and predicts bounding boxes and probabilities directly from full images in a single evaluation YOLO🏏🏏 YOLO means You only look once i 🚀🚀 learn from AIMER Society - Artificial Intelligence Medical and Engineering Researchers Society and Sai Satish #intenship #object detection
Like Comment
To view or add a comment, sign in
Ramin Mehran

Tech Lead @ Google DeepMind Multi-Modal perception/generation, AI Breakdown Podcaster
9mo
Report this post
In this episode, we discuss SciMON: Scientific Inspiration Machines Optimized for Novelty by Qingyun Wang, Doug Downey, Heng Ji, Tom Hope. The paper presents SCIMON, a new framework designed to push neural language models towards generating innovative scientific ideas that are informed by existing literature, going beyond simple binary link prediction. SCIMON generates natural language hypotheses by retrieving inspirations from previous papers and iteratively refining these ideas to enhance their novelty and ensure they are sufficiently distinct from prior research. Evaluations indicate that while models like GPT-4 tend to produce ideas lacking in novelty and technical depth, the SCIMON framework is capable of overcoming some of these limitations to inspire more original scientific thinking.

arxiv preprint - SciMON: Scientific Inspiration Machines Optimized for Novelty

podbean.com
Like Comment
To view or add a comment, sign in
Sounak Mondal

PhD Candidate @ Stony Brook University |Vision-Language Modeling, Multimodal Learning, Computer Vision, Natural Language Processing
5mo Edited
Report this post
I am thrilled to share that our paper "Look Hear: Gaze Prediction for Speech-directed Human Attention" has been accepted for #ECCV2024! Huge congrats to my co-authors Seo-Young Ahn, Zhibo Yang, Niranjan Balasubramanian, Dimitris Samaras, Gregory Zelinsky, and Minh Hoai Nguyen! In this work, we study how humans seamlessly integrate vision and language to direct attention towards specific goals during an incremental object referral task. To predict human gaze fixations in this naturalistic multimodal search scenario, we introduce Attention in Referral Transformer (ART) - a multimodal transformer architecture which integrates vision and language modalities to generate sequences of human gaze fixations. We have also collected a high quality and large-scale dataset, which we name RefCOCO-Gaze, containing gaze fixations of humans performing our incremental object referral task. For more details on RefCOCO-Gaze, visit https://lnkd.in/g-EmZeav. Pre-print: https://lnkd.in/gZx73Jcw Stay tuned for the code!
6 Comments
Like Comment
To view or add a comment, sign in