Infini-attention enables Transformer LLMs to process infinitely long inputs with bounded memory and computation using a new attention technique. Infini-attention incorporates compressive memory into the standard attention mechanism, integrating both local and long-term linear attention in one block. It reuses key, value, and query states for memory consolidation and retrieval, storing old KV states in the compressive memory. This approach aggregates long-term memory-retrieved values with local attention contexts, outperforming baseline models on long-context benchmarks and achieving better perplexity with 100K sequence length. A 1B LLM scales to 1M sequence length and solves the passkey retrieval task with Infini-attention. An 8B model with Infini-attention achieves state-of-the-art results on a 500K length book summarization task. Infini-attention is practical, minimally changes standard attention, supports continual pre-training, and enables scaling to infinitely long contexts with bounded memory. Infini-Transformer operates on sequence segments, reuses old KV states to maintain context history with compressive memory, combining global and local context states. It maintains multiple compressive memory heads per layer, reusing dot-product attention states for efficiency. Memory retrieval and updates use associative matrices, improving performance and stability. Infini-Transformer maintains a constant memory footprint, outperforming models like Transformer-XL and Memorizing Transformers with significantly less memory usage. It achieves better perplexity scores and efficiency in long-context language modeling, passkey retrieval, and book summarization tasks, demonstrating superior performance and scalability.
https://lnkd.in/gNGUqTNE
COO at FOOMOTION | Entrepreneur | Shopify Expert | Business Growth Coach | FinTech Solution's | Staff Augmentation
2moOutstanding work. As always. Your efforts aren't going unnoticed