Stoked to compare + contrast Liquid NNs, with RNNs + Transformers to see exerts the most influence on performance: Input dimensionality vs. network depth I'm going to prototype two concurrent SLMs using LT-CNN architectures (one prioritizing higher input node count, the other leveraging increased hidden layer depth) to empirically assess the impact.
Eros Marcello’s Post
More Relevant Posts
-
Why only break the encoder? Decoder will feel left out. So let's now do a breakdown of how decoder only architectures work. To me decoder only architectures made no sense when I first heard of them. How do they predict they predict the very FIRST word when they receive the SOS token? How do they create a variation of first words if they have no basis to predict the first word? This wasn't an issue in transformer or Bert because they both use some form of context to predict the token. Bert uses bidirectional context, and Transformer decoder using Q_enc and K_enc from the encoder to take reference and predict the first word. But in case of decoders they have absolutely no reference if the word is simply the SOS token. The explanation for which along with everything else is given in this article. https://lnkd.in/gRXVvYC5
Decoding the Decoder King: GPT-2
medium.com
To view or add a comment, sign in
-
If you really want to understand the self-attention mechanism from the “Attention is all you need” paper and grasp the insight of the whole matrix multiplication process for transformer-decoder architectures, this visual explanation is the best I found so far, highly recommended Youtube channel! https://lnkd.in/d8-Jk8s6
Attention in transformers, visually explained | DL6
https://www.youtube.com/
To view or add a comment, sign in
-
Our latest blog explores the basics of Transformers, their unique mechanisms like self-attention and multi-head attention, and their transformative impact on the field. Check it out to dive into the future of AI! 🌐 Credits - Kumar Kanishk #thinknyx #MachineLearning #NLP #ArtificialIntelligence #Transformers #DeepLearning #BERT #GPT #TechInnovation #AIResearch #DataScience https://lnkd.in/gEnFRghC
Understanding Transformer Architectures | Thinknyx
https://www.thinknyx.com
To view or add a comment, sign in
-
For inference, speed matters a lot. This paper by Shanghai Tech University proposes a method to only compute and cache the key-value (KV) cache for the attention mechanism in the transformer architecture of a small number of layers, thus significantly saving memory consumption and improving inference throughput If generally applicable, this could have a substantial impact on local LLM feasibility. Extra bonus points for also having their GitHub up: https://lnkd.in/edKhW3HD https://lnkd.in/erazkf76
2405.10637
arxiv.org
To view or add a comment, sign in
-
Kolmogorov–Arnold Networks seem like a paradigm shift in ML architectures. Curious to see how they evolve. "While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. " https://lnkd.in/gdPk_qCQ
KAN: Kolmogorov-Arnold Networks
arxiv.org
To view or add a comment, sign in
-
This quantisation technique will cut inference costs for both enterprise and consumers, and will put larger models within reach of more modest hardware! I'm a big fan of quantisation and the described VPTQ is designed to allow 2-bit quantisation with minimal loss using lookup tables. This will allow you to run large models on undersized hardware. I have previously run 2-bit versions of Llama-3-70b which took up about 23GB VRAM. This technique would double token/s and improve accuracy greatly. I expect to see this integrated more in the near future. Catchy name too. https://lnkd.in/egzxCn_G
Papers with Code - VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
paperswithcode.com
To view or add a comment, sign in
-
I want to share a recent publication of our work with Denis Osipov and Luigi Vanfretti that was done some time ago. It is a comprehensive comparison study of several CNN architectures applied for PMU-like time series data answering the questions: Which of the CNN architectures perform well for fast small signal stability assessment task? Which data length is reasonable as the input preserving a good performance? Which type of measurements is the best for the considered setup? Does a nonlinear pattern of contingency bring important information for the CNN training for the task? You are welcome to review: https://lnkd.in/eNy-Fape
Fast small signal stability assessment using deep convolutional neural networks
sciencedirect.com
To view or add a comment, sign in
-
Dive into the world of deep learning with my exciting new article! Learn how PyTorch multiprocessing and data distribution can transform ensemble learning for model uncertainty quantification. Discover how parallel computing can reduce computation times and accelerate inference processes, making sophisticated uncertainty quantification techniques accessible for real-world applications. This exploration offers invaluable insights for AI practitioners looking to enhance model reliability and efficiency. Don't miss out on the synergy between advanced computational techniques and deep learning innovation. Check out the article now! #deeplearning #pytorch #artificialintelligence #UncertaintyEstimation #machinelearning #ensemblelearning #multiprocessing
Scaling Inference with Multi-GPU Architectures: A Deep Dive into Uncertainty Estimation
link.medium.com
To view or add a comment, sign in
-
If you didn't see it yet, I really recommend you to have a look to this recent article: https://lnkd.in/d6fQKuyi, especially if you're interesting in time-series forcasting/classification. They propose to treat transformer model as a RNN by reformulating the equation of attention mechanism, by introducing recurrence they allow reducing significantly memory ressources to O(N) (O(Nlog(N)) for the prefix scan algorithm) instead of a quadratic complexity.
2405.13956
arxiv.org
To view or add a comment, sign in
-
Key Takeaways from the Research: Performance Gains: Chameleon reduced tail latency (P99 TTFT) by 80.7% and median latency (P50 TTFT) by 48.1%, significantly improving response times under heavy workloads. Enhanced Throughput: The system achieved 1.5x higher throughput than baseline methods, allowing for more concurrent requests. Dynamic Resource Management: Adaptive caching effectively utilized idle GPU memory, dynamically resizing the cache based on system demand to minimize adapter reloads. Innovative Scheduling: The multi-level queue scheduler eliminated head-of-line blocking and ensured fair resource allocation, preventing starvation of larger requests. Scalability: Chameleon efficiently supported adapter ranks from 8 to 128, demonstrating its suitability for diverse task complexities in multi-adapter settings. Broader Implications: This research sets a precedent for designing inference systems that balance efficiency and scalability, addressing real-world production challenges in deploying large-scale LLMs. courtesy: markettechpost
Chameleon: An AI System for Efficient Large Language Model Inference Using Adaptive Caching and Multi-Level Scheduling Techniques
https://www.marktechpost.com
To view or add a comment, sign in