Rachel Hu’s Post

2mo

"How does a vision LLM outperform OCR, and why?" I got this question from our customers a lot and we distill the answers into the blog: https://lnkd.in/guv8waCs Overall, Vision language models (VLM) address key OCR shortcomings by: 1. Accurately interpreting low-quality images and complex layouts 2. Understanding context and semantic meaning in documents 3. Handling multiple languages and scripts seamlessly 4. Reducing post-processing overhead #pdf #word #ocr #llm

Vision Language Models: Moving Beyond OCR's Limitations

cambioml.com

3 Comments

Imtiaz Ali

2mo

Outstanding work. As always. Your efforts aren't going unnoticed

Kevin Law

Passionate Educator & Aspiring Edtech Innovator | Leveraging AI to Enhance Learning Experiences

2mo

VLMs have been my go-to for handwritten text recognition. It was far more accurate than OCR - great to see a breakdown of why that's the case!

Shradha Agarwal (PHD-Physics)

Senior Research Scientist (NLP and CV)/PhD-Physics/ IIT Rank-top 10%/ EB1-A (Oustanding researcher) US green card recepient

2mo

Rachel Hu : please could you also give an idea about how vision LLM outperform trOCR❓

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

TuringPost

5,086 followers
4mo
Report this post
How does Speculative RAG work? Its main purpose is to use small language models (SLMs) to boost larger frameworks. That's why the architecture of this type of RAG has 2 new components: • RAG Drafter (a smaller, specialized model): Creates multiple drafts with rationales from different document subsets, providing answers and rationales for each subset. • RAG Verifier, or Generalist LM (larger, general-purpose model): Evaluates and verifies the drafts generated by the RAG Drafter, choosing the best answer. Multiple drafts are short enough, diverse and can be created in parallel. This saves processing time of the system and solves the problem of handling long-context input. Explore where Speculative RAG excels and its limitations in our AI 101 latest episode: https://lnkd.in/gezS7D2u
Like Comment
To view or add a comment, sign in
Charlie Lopez

Data Scientist | M.Sc. Physics & M.Sc. Engineering
9mo
Report this post
While Vision Language Models (VLMs) have rapidly improved thanks to their integration with LLMs with image-text pairs, and while they've become very effective at understanding full images, the limited spatial awareness of their vision encoder makes them struggle with analyzing specific regions in detail. Therefore, the authors proposed a novel framework that leverages on the capabilities of LLMs for complex region-level captioning and understanding: RegionGPT. By applying simple and effective modifications to existing visual encoders and reframing vision tasks to VQA tasks, they are able to enhance spatial awareness and eliminate ambiguities in the outputs. Certainly a very exciting improvement for VLMs and to expand their utilization hand-in-hand with LLMs. According to the authors, RegionGPT can be utilized to significantly improve performance across a wide variety of tasks such as: complex region descriptions, reasoning, object classification, and expressions comprehension. https://lnkd.in/gHD5HF5h #vlm #llm #computervision #vqa #transformers #researchpaper

RegionGPT: Towards Region Understanding Vision Language Model

arxiv.org
Like Comment
To view or add a comment, sign in
Stephen Pimentel

CTO at Dragonscale
6mo
Report this post
Infini-attention enables Transformer LLMs to process infinitely long inputs with bounded memory and computation using a new attention technique. Infini-attention incorporates compressive memory into the standard attention mechanism, integrating both local and long-term linear attention in one block. It reuses key, value, and query states for memory consolidation and retrieval, storing old KV states in the compressive memory. This approach aggregates long-term memory-retrieved values with local attention contexts, outperforming baseline models on long-context benchmarks and achieving better perplexity with 100K sequence length. A 1B LLM scales to 1M sequence length and solves the passkey retrieval task with Infini-attention. An 8B model with Infini-attention achieves state-of-the-art results on a 500K length book summarization task. Infini-attention is practical, minimally changes standard attention, supports continual pre-training, and enables scaling to infinitely long contexts with bounded memory. Infini-Transformer operates on sequence segments, reuses old KV states to maintain context history with compressive memory, combining global and local context states. It maintains multiple compressive memory heads per layer, reusing dot-product attention states for efficiency. Memory retrieval and updates use associative matrices, improving performance and stability. Infini-Transformer maintains a constant memory footprint, outperforming models like Transformer-XL and Memorizing Transformers with significantly less memory usage. It achieves better perplexity scores and efficiency in long-context language modeling, passkey retrieval, and book summarization tasks, demonstrating superior performance and scalability. https://lnkd.in/gNGUqTNE

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

arxiv.org
Like Comment
To view or add a comment, sign in
Bhaskara Reddy Sannapureddy

Senior Project Manager|Infosys|B.E(Hons) BITS, Pilani & PGD in ML & AI at IIITB & Master of Science in ML & AI at LJMU, UK | (Building AI for World & Create AICX)(Learn, Unlearn, Relearn)
8mo
Report this post
Read paper on: Efficient Infinite Context Transformers Very exciting paper by Google that integrates compressive memory into a vanilla dot-product attention layer. The goal is to enable Transformer LLMs to effectively process infinitely long inputs with bounded memory footprint and computation. They propose a new attention technique called Infini-attention which incorporates a compressive memory module into a vanilla attention mechanism. It builds in both masked local attention and long-term linear attention into a single Transformer block. This allows the Infini-Transformer model to efficiently handle both long and short-range contextual dependencies. This approach outperforms baseline models on long-context language modeling with a 114x compression ratio of memory! They also show that a 1B LLM can naturally scale to a 1M sequence length and a 8B model achieves a new SoTA result on a 500K length book summarization task. Given how important long-context LLMs are becoming having an effective memory system could unlock powerful reasoning, planning, continual adaption, and capabilities not seen before in LLMs. Great paper! Paper : https://lnkd.in/gpKHpiUs Efficiently processing infinite contexts with LLMs is a game-changer. Memory compression & combining local & long-range attention in one block is brilliant. Unlocks exciting possibilities for LLM reasoning & capabilities!
Like Comment
To view or add a comment, sign in
Stephen Pimentel

CTO at Dragonscale
5mo
Report this post
Using visual question answering (VQA) as an evaluation protocol enhances the assessment of visual representations by introducing CV-Bench, a vision-centric multimodal large language model (MLLM) benchmark. Cambrian-1, structured around exploring vision encoders, designing dynamic connectors, curating high-quality instruction-tuning data, and analyzing benchmarks, outperforms in various tasks and supports open research. Key components include large language models, visual encoders like CLIP, multimodal connectors, and data curation pipelines. Two-stage training and unfreezing vision encoders improve performance. SVA, a new connector design, aggregates features from multiple encoders, enhancing high-resolution image processing. Curated Cambrian-7M dataset boosts performance across categories, while system prompts address the "answer machine phenomenon," maintaining conversational abilities. Cambrian-1 models excel in high-resolution tasks with fewer tokens, highlighting the benefits of vision-centric design and comprehensive evaluation protocols. https://lnkd.in/gPmcf7HN
Like Comment
To view or add a comment, sign in
Sanchit Tiwari

Associate Partner at McKinsey & Company I Senior Principal at QuantumBlack, AI by McKinsey
4mo
Report this post
Llama 3.1 models is here now with 128K token context window and improved support for 8 languages among other improvements. The Meta Llama 3.1 family of multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in 8B, 70B, and 405B sizes (text in/text out). Llama 3.1 is an auto-regressive language model with an optimized transformer architecture, using SFT and RLHF for alignment. Its core LLM architecture is the same dense structure as Llama 3 for text input and output. Llama 3.1 Instruct Model (Text) is fine-tuned for tool use, enabling it to generate tool calls for search, image generation, code execution, and mathematical reasoning, and also supports zero-shot tool use. https://lnkd.in/gVJjcP8F

Introducing Llama 3.1: Our most capable models to date

ai.meta.com
Like Comment
To view or add a comment, sign in
EnterpriseTalk

3,969 followers
8mo
Report this post
Deploying Large Language Models (LLMs) is a step towards enhancing user experience. But, knowing where to start and which aspects to consider before LLM deployment is essential. Here are 8 Factors to consider before deploying LLM

What to Consider Before Deploying Large Language Models (LLMs)

https://enterprisetalk.com
Like Comment
To view or add a comment, sign in
Dr. Subhabaha Pal

Co-Founder, Chief AI & Analytics Advisor @ InstaDataHelp | Innovator and Patent-Holder in Gen AI and LLM | Data Science Thought Leader and Blogger | FRSS(UK) FSASS FRIOASD | 16+ Years of Excellence
7mo
Report this post
The text discusses customizing the output of large language models (LLMs) by using post-processing techniques such as greedy decoding, beam search, sampling, temperature adjustment, and advanced techniques like frequency and presence penalties, logit bias, and structured outputs. It emphasizes understanding and implementing these techniques for better control over the outputs. The LLMs generate output by predicting the next token based on previous ones using a vector of logits to represent the probability of each token. These techniques result in various outputs in JSON format, creative narratives, or structured data, providing a high level of versatility for different applications. It also highlights the challenges and solution approaches for implementing post-processing techniques and suggests utilizing the existing APIs and libraries for these techniques. The detailed and comprehensive overview explains the process of LLM output generation and how to influence it effectively. Let me know if you need any further assistance.
Like Comment
To view or add a comment, sign in
Kapil Uthra

Driving Digital Transformation | AI & Cloud Enthusiast | OpenText ECM/EIM Expert
8mo
Report this post
𝐓𝐞𝐬𝐭𝐢𝐧𝐠 𝐥𝐚𝐫𝐠𝐞 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐦𝐨𝐝𝐞𝐥𝐬 (𝐋𝐋𝐌𝐬) 𝐢𝐬 𝐜𝐫𝐮𝐜𝐢𝐚𝐥 𝐟𝐨𝐫 𝐞𝐧𝐬𝐮𝐫𝐢𝐧𝐠 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐚𝐧𝐝 𝐫𝐞𝐥𝐢𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐢𝐧 𝐠𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈 𝐚𝐩𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬. 𝐇𝐞𝐫𝐞 𝐚𝐫𝐞 𝐟𝐨𝐮𝐫 𝐤𝐞𝐲 𝐬𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬: 🛠️ 𝐂𝐫𝐞𝐚𝐭𝐞 𝐓𝐞𝐬𝐭 𝐃𝐚𝐭𝐚: Develop relevant test datasets to understand user needs and benchmarks. Conduct thorough testing including unit, functional, regression, and bias testing. 🤖 𝐀𝐮𝐭𝐨𝐦𝐚𝐭𝐞 𝐓𝐞𝐬𝐭𝐢𝐧𝐠: Utilize automated methods for efficient model quality and performance evaluation. Balance automation with manual verification for comprehensive testing. 🎯 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐞 𝐑𝐀𝐆 𝐐𝐮𝐚𝐥𝐢𝐭𝐲: Employ Retrieval-Augmented Generation (RAG) techniques to enhance LLM accuracy. Test RAG effectiveness using gold standard datasets, reinforcement learning, or adversarial networks. 📊 𝐃𝐞𝐯𝐞𝐥𝐨𝐩 𝐌𝐞𝐭𝐫𝐢𝐜𝐬: Establish measurable KPIs to validate LLM performance. Consider metrics like accuracy, consistency, and user feedback for continuous improvement.

How to test large language models

infoworld.com
Like Comment
To view or add a comment, sign in
Stephen Pimentel

CTO at Dragonscale
6mo Edited
Report this post
Large Language Models (LLMs) are powerful cognitive aids that can generate ideas and mix different fields of study but lack inherent planning or verification capabilities. The LLM-Modulo framework addresses this by combining LLMs with external verifiers and human input in a generate-test-critique loop. This ensures that LLM-generated plans are formally correct through evaluation by automated model-based verifiers or human experts. To date, LLMs alone perform poorly in autonomous planning tasks, but when integrated with external validators, as demonstrated in travel planning benchmarks, their effectiveness significantly improves. This framework is also applicable in reinforcement learning, where LLMs suggest policies evaluated by simulators, leveraging their strengths while ensuring reliable planning outcomes.

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks

arxiv.org
Like Comment
To view or add a comment, sign in

8,438 followers

View Profile Connect

Rachel Hu’s Post

Vision Language Models: Moving Beyond OCR's Limitations

cambioml.com

More from this author

Must-Reads for First-time Tech Founders

Learn to be a founder (at YC)

Applying AI/ML on Earth Science - My Thoughts from EarthCube Annual Meeting

Explore topics