It is very apparent that open LLMs are gaining more popularity, and Kubernetes is the most scalable serving choice. There are two serving frameworks that support full Kubernetes deployment: Kserve (CNCF project) and RayServe. Additionally, as a physical inference server, NVIDIA Triton serves open LLMs extremely well. #knative #kserve #rayserve #kuberay #nvidiatriton Oftentimes, I get questions about which one to use for Kubernetes inference, and the answer is that it depends. Rather than focusing on the framework, it’s better to focus more on custom load balancing and autoscaling of serving, gather key performance metrics, use KV cache utilization, and other performance utilization metrics. Gathering LLM performance data and solving accordingly is the most critical part of MLOps. #kubernetes #mlops #openllm
Roy Budhaditya’s Post
More Relevant Posts
-
Quora modernises model serving with NVIDIA Triton on Amazon EKS. Read this blog post to learn how leading Q&A platform Quora modernised its model serving architecture with NVIDIA Triton Inference Server on Amazon EKS. This post details Quora's design decisions and how they reduced model serving latency by 3x and model serving cost by 25%.
To view or add a comment, sign in
-
A lot of customers I speak to are interested in GenAI for software development. They want to enable their #devs to spend their time coding and improve productivity. The challenge, however, is how to do this in a secure, controlled, and legal manner. One option is running an #LLM on premise, like #CodeLLamaa from #Meta or a solution like Codeium Enterprise from Codeium. Dell and Codeium have worked together to establish affordable hardware configurations that can be used to run Codeium Enterprise on-premises, entirely air gapped to keep your intellectual property and data secure. #GenAI #AI #DevOps Learn more about this collaborative solution: https://lnkd.in/eMNBuA3D
Solution Brief–Codeium Enterprise on Dell Infrastructure
infohub.delltechnologies.com
To view or add a comment, sign in
-
Came across this fascinating blog from 2021 where 𝗢𝗽𝗲𝗻𝗔𝗜 𝘀𝗰𝗮𝗹𝗲𝗱 𝘁𝗵𝗲𝗶𝗿 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 𝗰𝗹𝘂𝘀𝘁𝗲𝗿𝘀 𝘁𝗼 𝗮𝗻 𝗶𝗻𝗰𝗿𝗲𝗱𝗶𝗯𝗹𝗲 𝟳,𝟱𝟬𝟬 𝗻𝗼𝗱𝗲𝘀! It’s amazing to see how they managed to handle massive workloads for their LLM models, optimizing both hardware and network efficiency. This cluster supported large-scale LLM models for rapid research and required unique solutions like transitioning from 𝗙𝗹𝗮𝗻𝗻𝗲𝗹 𝘁𝗼 𝗻𝗮𝘁𝗶𝘃𝗲 𝗽𝗼𝗱 𝗻𝗲𝘁𝘄𝗼𝗿𝗸𝗶𝗻𝗴 for better throughput, leveraging NVIDIA GPUs, and using 𝗘𝗻𝗱𝗽𝗼𝗶𝗻𝘁𝗦𝗹𝗶𝗰𝗲𝘀 to handle API server load. 𝗞𝗲𝘆 𝗛𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝘀:⬇️ • Networking: Switched to native pod networking to handle around 200,000 IPs, ensuring high throughput. • Resource Management: Tools like the "team-resource-manager" dynamically allocated nodes, while GPU health checks kept everything running smoothly. • API Server Optimization: Using EndpointSlices reduced API load, maintaining cluster performance at scale. Matt Rickard, ex-Googler who worked on Kubernetes and was one of Kubeflow maintainers, posted a few observations about why he thinks OpenAI prefers K8s over HPC frameworks and what makes this K8s usage case special. This detailed journey is a goldmine for anyone working on large-scale Kubernetes deployments! 💻 👉 Full blog here https://lnkd.in/gSjFkjZy #Kubernetes #CloudNative #Scalability #OpenAI #TechInsights
Scaling Kubernetes to 7,500 nodes
openai.com
To view or add a comment, sign in
-
As much fun as it is to use LLMs, I was curious to learn more about the hardware that we use behind the scenes to create such innovative models. I got to take a moment to explore more of the infrastructure world while at #MSIgnite! 👩🏾💻 I met with Alistair Speirs who gave me a tour of one of our AI servers, including a look at a GPU! #MicrosoftEmployee #AI #Infrastructure #LLMs #LLM
To view or add a comment, sign in
-
"At AMD, we've been finding new ways to advance Moore's Law by driving deeper integration." 🔥 In our on-demand episode AMD's Robert Hormuth and Supermicro's Tom Garvens discuss how server architecture is evolving to support artificial intelligence. Stream this episode to discover how to: ⚡Leverage server adaptability for seamless integration of diverse GPUs and specialized AI accelerators ⚡Implement design strategies that boost computational throughput and minimize latency ⚡Future-proof server architecture to stay ahead of evolving AI demands Stream here: https://okt.to/eNrkTc #Servers #artificialintelligence #Compute
To view or add a comment, sign in
-
AI takes A LOT of resources to run properly. That's why GPUs are so expensive. But what about if you run the workloads on Kubernetes? ✅ You still need beefy systems ✅ GPUs still need to be powerful but the footprint is much smaller. It's the same concept from when we went bare metal > VMs > containers. I created a blog post for running ML workloads on Kubernetes with deployKF. Link below 👇 https://lnkd.in/esrtk-sZ #kubernetes #devops #platformengineering
To view or add a comment, sign in
-
📈 StackOS is empowering #DePINs with the decentralized compute power they need to build the future. If you want to learn more about @DeployOnStackOS , you can navigate to our issue#08 and delve deeper into this vibrant project! 🚀🔍 https://lnkd.in/gz6ByN_p (page 30)
To view or add a comment, sign in
-
NVIDIA Triton Inference Server is a powerful tool for deploying machine learning models in production environments, specifically designed to run on Kubernetes. Learn how NGINX Plus Ingress Controller can provide secure external access -as well as load balancing- to a Kubernetes-hosted NVIDIA Triton Inference Server cluster!
How I did it - "Securing Nvidia Triton Inference Server with NGINX Plus Ingress Controller" | DevCentral
community.f5.com
To view or add a comment, sign in
-
NVIDIA Triton Inference Server is a powerful tool for deploying machine learning models in production environments, specifically designed to run on Kubernetes. Learn how NGINX Plus Ingress Controller can provide secure external access -as well as load balancing- to a Kubernetes-hosted NVIDIA Triton Inference Server cluster!
How I did it - "Securing Nvidia Triton Inference Server with NGINX Plus Ingress Controller" | DevCentral
community.f5.com
To view or add a comment, sign in
-
NVIDIA Triton Inference Server is a powerful tool for deploying machine learning models in production environments, specifically designed to run on Kubernetes. Learn how NGINX Plus Ingress Controller can provide secure external access -as well as load balancing- to a Kubernetes-hosted NVIDIA Triton Inference Server cluster!
How I did it - "Securing Nvidia Triton Inference Server with NGINX Plus Ingress Controller" | DevCentral
community.f5.com
To view or add a comment, sign in