Eros Marcello’s Post

Agentic AI Engineer, Deep Tech • prev: , fb

2mo Edited

Stoked to compare + contrast Liquid NNs, with RNNs + Transformers to see exerts the most influence on performance: Input dimensionality vs. network depth I'm going to prototype two concurrent SLMs using LT-CNN architectures (one prioritizing higher input node count, the other leveraging increased hidden layer depth) to empirically assess the impact.

Liquid AI debuts new LFM-based models that seem to outperform most traditional large language models - SiliconANGLE

siliconangle.com

To view or add a comment, sign in

More Relevant Posts

Pratyush Rao

AI/ML Developer
8mo
Report this post
Why only break the encoder? Decoder will feel left out. So let's now do a breakdown of how decoder only architectures work. To me decoder only architectures made no sense when I first heard of them. How do they predict they predict the very FIRST word when they receive the SOS token? How do they create a variation of first words if they have no basis to predict the first word? This wasn't an issue in transformer or Bert because they both use some form of context to predict the token. Bert uses bidirectional context, and Transformer decoder using Q_enc and K_enc from the encoder to take reference and predict the first word. But in case of decoders they have absolutely no reference if the word is simply the SOS token. The explanation for which along with everything else is given in this article. https://lnkd.in/gRXVvYC5

Decoding the Decoder King: GPT-2

medium.com
Like Comment
To view or add a comment, sign in
Íñigo Luis López-Riobóo Botana

Senior NLP/ML Engineer at Newtral | Inaria | Trueflag
8mo
Report this post
If you really want to understand the self-attention mechanism from the “Attention is all you need” paper and grasp the insight of the whole matrix multiplication process for transformer-decoder architectures, this visual explanation is the best I found so far, highly recommended Youtube channel! https://lnkd.in/d8-Jk8s6

Attention in transformers, visually explained | DL6

https://www.youtube.com/

3 Comments
Like Comment
To view or add a comment, sign in
Thinknyx Technologies

3,239 followers
1mo
Report this post
Our latest blog explores the basics of Transformers, their unique mechanisms like self-attention and multi-head attention, and their transformative impact on the field. Check it out to dive into the future of AI! 🌐 Credits - Kumar Kanishk #thinknyx #MachineLearning #NLP #ArtificialIntelligence #Transformers #DeepLearning #BERT #GPT #TechInnovation #AIResearch #DataScience https://lnkd.in/gEnFRghC

Understanding Transformer Architectures | Thinknyx

https://www.thinknyx.com
Like Comment
To view or add a comment, sign in
Ternary Capital Group

177 followers
7mo
Report this post
For inference, speed matters a lot. This paper by Shanghai Tech University proposes a method to only compute and cache the key-value (KV) cache for the attention mechanism in the transformer architecture of a small number of layers, thus significantly saving memory consumption and improving inference throughput If generally applicable, this could have a substantial impact on local LLM feasibility. Extra bonus points for also having their GitHub up: https://lnkd.in/edKhW3HD https://lnkd.in/erazkf76

2405.10637

arxiv.org
Like Comment
To view or add a comment, sign in
Jagadish Venkataraman

VP, Engineering
6mo
Report this post
Kolmogorov–Arnold Networks seem like a paradigm shift in ML architectures. Curious to see how they evolve. "While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. " https://lnkd.in/gdPk_qCQ

KAN: Kolmogorov-Arnold Networks

arxiv.org
Like Comment
To view or add a comment, sign in
Roan Caws

RedTeam & GenAI
2mo
Report this post
This quantisation technique will cut inference costs for both enterprise and consumers, and will put larger models within reach of more modest hardware! I'm a big fan of quantisation and the described VPTQ is designed to allow 2-bit quantisation with minimal loss using lookup tables. This will allow you to run large models on undersized hardware. I have previously run 2-bit versions of Llama-3-70b which took up about 23GB VRAM. This technique would double token/s and improve accuracy greatly. I expect to see this integrated more in the near future. Catchy name too. https://lnkd.in/egzxCn_G

Papers with Code - VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

paperswithcode.com
Like Comment
To view or add a comment, sign in
Tetiana Bogodorova, Ph.D.

Sr. Research Scientist at Rensselaer Polytechnic Institute
5mo
Report this post
I want to share a recent publication of our work with Denis Osipov and Luigi Vanfretti that was done some time ago. It is a comprehensive comparison study of several CNN architectures applied for PMU-like time series data answering the questions: Which of the CNN architectures perform well for fast small signal stability assessment task? Which data length is reasonable as the input preserving a good performance? Which type of measurements is the best for the considered setup? Does a nonlinear pattern of contingency bring important information for the CNN training for the task? You are welcome to review: https://lnkd.in/eNy-Fape

Fast small signal stability assessment using deep convolutional neural networks

sciencedirect.com
Like Comment
To view or add a comment, sign in
Santhosh M P.

SDE @ Amazon | Research Assistant @ IUB | MS in Data Science | Ex-IBM | ML Engineer | Data Scientist | Software Engineer
9mo Edited
Report this post
Dive into the world of deep learning with my exciting new article! Learn how PyTorch multiprocessing and data distribution can transform ensemble learning for model uncertainty quantification. Discover how parallel computing can reduce computation times and accelerate inference processes, making sophisticated uncertainty quantification techniques accessible for real-world applications. This exploration offers invaluable insights for AI practitioners looking to enhance model reliability and efficiency. Don't miss out on the synergy between advanced computational techniques and deep learning innovation. Check out the article now! #deeplearning #pytorch #artificialintelligence #UncertaintyEstimation #machinelearning #ensemblelearning #multiprocessing

Scaling Inference with Multi-GPU Architectures: A Deep Dive into Uncertainty Estimation

link.medium.com

3 Comments
Like Comment
To view or add a comment, sign in
Adrien Gresse

PhD; AI Research Engineer; Machine Learning; Speech Processing
6mo
Report this post
If you didn't see it yet, I really recommend you to have a look to this recent article: https://lnkd.in/d6fQKuyi, especially if you're interesting in time-series forcasting/classification. They propose to treat transformer model as a RNN by reformulating the equation of attention mechanism, by introducing recurrence they allow reducing significantly memory ressources to O(N) (O(Nlog(N)) for the prefix scan algorithm) instead of a quadratic complexity.

2405.13956

arxiv.org

2 Comments
Like Comment
To view or add a comment, sign in
Anandteerth R Parvatikar

ML | DS | DL | Analytics | Cloud
2w
Report this post
Key Takeaways from the Research: Performance Gains: Chameleon reduced tail latency (P99 TTFT) by 80.7% and median latency (P50 TTFT) by 48.1%, significantly improving response times under heavy workloads. Enhanced Throughput: The system achieved 1.5x higher throughput than baseline methods, allowing for more concurrent requests. Dynamic Resource Management: Adaptive caching effectively utilized idle GPU memory, dynamically resizing the cache based on system demand to minimize adapter reloads. Innovative Scheduling: The multi-level queue scheduler eliminated head-of-line blocking and ensured fair resource allocation, preventing starvation of larger requests. Scalability: Chameleon efficiently supported adapter ranks from 8 to 128, demonstrating its suitability for diverse task complexities in multi-adapter settings. Broader Implications: This research sets a precedent for designing inference systems that balance efficiency and scalability, addressing real-world production challenges in deploying large-scale LLMs. courtesy: markettechpost

Chameleon: An AI System for Efficient Large Language Model Inference Using Adaptive Caching and Multi-Level Scheduling Techniques

https://www.marktechpost.com
Like Comment
To view or add a comment, sign in

7,049 followers

779 Posts

View Profile Follow

Eros Marcello’s Post

More Relevant Posts

Attention in transformers, visually explained | DL6

https://www.youtube.com/

Explore topics