Roy Budhaditya’s Post

Engineering Leadership | Data & AI Engineering | Building Scalable, Modern Data & AI Platforms

2w Edited

It is very apparent that open LLMs are gaining more popularity, and Kubernetes is the most scalable serving choice. There are two serving frameworks that support full Kubernetes deployment: Kserve (CNCF project) and RayServe. Additionally, as a physical inference server, NVIDIA Triton serves open LLMs extremely well. #knative #kserve #rayserve #kuberay #nvidiatriton Oftentimes, I get questions about which one to use for Kubernetes inference, and the answer is that it depends. Rather than focusing on the framework, it’s better to focus more on custom load balancing and autoscaling of serving, gather key performance metrics, use KV cache utilization, and other performance utilization metrics. Gathering LLM performance data and solving accordingly is the most critical part of MLOps. #kubernetes #mlops #openllm

To view or add a comment, sign in

More Relevant Posts

Charlie Higgs

Enterprise Account Executive @ Amazon Web Services (AWS)
6mo
Report this post
Quora modernises model serving with NVIDIA Triton on Amazon EKS. Read this blog post to learn how leading Q&A platform Quora modernised its model serving architecture with NVIDIA Triton Inference Server on Amazon EKS. This post details Quora's design decisions and how they reduced model serving latency by 3x and model serving cost by 25%.

Quora achieved 3x lower latency and 25% lower Costs by modernizing model serving with Nvidia Triton on Amazon EKS | Amazon Web Services
Like Comment
To view or add a comment, sign in
Neil Bowden

AI from the Edge to the Cloud
4mo
Report this post
A lot of customers I speak to are interested in GenAI for software development. They want to enable their #devs to spend their time coding and improve productivity. The challenge, however, is how to do this in a secure, controlled, and legal manner. One option is running an #LLM on premise, like #CodeLLamaa from #Meta or a solution like Codeium Enterprise from Codeium. Dell and Codeium have worked together to establish affordable hardware configurations that can be used to run Codeium Enterprise on-premises, entirely air gapped to keep your intellectual property and data secure. #GenAI #AI #DevOps Learn more about this collaborative solution: https://lnkd.in/eMNBuA3D

Solution Brief–Codeium Enterprise on Dell Infrastructure

infohub.delltechnologies.com
Like Comment
To view or add a comment, sign in
Manas Sharma

Building OpenObserve | Observability & Opentelemetry
2mo Edited
Report this post
Came across this fascinating blog from 2021 where 𝗢𝗽𝗲𝗻𝗔𝗜 𝘀𝗰𝗮𝗹𝗲𝗱 𝘁𝗵𝗲𝗶𝗿 𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 𝗰𝗹𝘂𝘀𝘁𝗲𝗿𝘀 𝘁𝗼 𝗮𝗻 𝗶𝗻𝗰𝗿𝗲𝗱𝗶𝗯𝗹𝗲 𝟳,𝟱𝟬𝟬 𝗻𝗼𝗱𝗲𝘀! It’s amazing to see how they managed to handle massive workloads for their LLM models, optimizing both hardware and network efficiency. This cluster supported large-scale LLM models for rapid research and required unique solutions like transitioning from 𝗙𝗹𝗮𝗻𝗻𝗲𝗹 𝘁𝗼 𝗻𝗮𝘁𝗶𝘃𝗲 𝗽𝗼𝗱 𝗻𝗲𝘁𝘄𝗼𝗿𝗸𝗶𝗻𝗴 for better throughput, leveraging NVIDIA GPUs, and using 𝗘𝗻𝗱𝗽𝗼𝗶𝗻𝘁𝗦𝗹𝗶𝗰𝗲𝘀 to handle API server load. 𝗞𝗲𝘆 𝗛𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝘀:⬇️ • Networking: Switched to native pod networking to handle around 200,000 IPs, ensuring high throughput. • Resource Management: Tools like the "team-resource-manager" dynamically allocated nodes, while GPU health checks kept everything running smoothly. • API Server Optimization: Using EndpointSlices reduced API load, maintaining cluster performance at scale. Matt Rickard, ex-Googler who worked on Kubernetes and was one of Kubeflow maintainers, posted a few observations about why he thinks OpenAI prefers K8s over HPC frameworks and what makes this K8s usage case special. This detailed journey is a goldmine for anyone working on large-scale Kubernetes deployments! 💻 👉 Full blog here https://lnkd.in/gSjFkjZy #Kubernetes #CloudNative #Scalability #OpenAI #TechInsights

Scaling Kubernetes to 7,500 nodes

openai.com
Like Comment
To view or add a comment, sign in
April Speight

AI @ Microsoft | Author
1w
Report this post
As much fun as it is to use LLMs, I was curious to learn more about the hardware that we use behind the scenes to create such innovative models. I got to take a moment to explore more of the infrastructure world while at #MSIgnite! 👩🏾💻 I met with Alistair Speirs who gave me a tour of one of our AI servers, including a look at a GPU! #MicrosoftEmployee #AI #Infrastructure #LLMs #LLM

11 Comments
Like Comment
To view or add a comment, sign in
DatacenterDynamics

64,699 followers
1mo Edited
Report this post
"At AMD, we've been finding new ways to advance Moore's Law by driving deeper integration." 🔥 In our on-demand episode AMD's Robert Hormuth and Supermicro's Tom Garvens discuss how server architecture is evolving to support artificial intelligence. Stream this episode to discover how to: ⚡Leverage server adaptability for seamless integration of diverse GPUs and specialized AI accelerators ⚡Implement design strategies that boost computational throughput and minimize latency ⚡Future-proof server architecture to stay ahead of evolving AI demands Stream here: https://okt.to/eNrkTc #Servers #artificialintelligence #Compute

1 Comment
Like Comment
To view or add a comment, sign in
Michael Levan 🛫 AWS reInvent

Simplifying Kubernetes and Platform Engineering┇Independent Engineer/Consultant, Content Creator, and Trainer┇4x Published Author┇Public Speaker┇k8s v1.28 and v1.31 Release Team
6mo Edited
Report this post
AI takes A LOT of resources to run properly. That's why GPUs are so expensive. But what about if you run the workloads on Kubernetes? ✅ You still need beefy systems ✅ GPUs still need to be powerful but the footprint is much smaller. It's the same concept from when we went bare metal > VMs > containers. I created a blog post for running ML workloads on Kubernetes with deployKF. Link below 👇 https://lnkd.in/esrtk-sZ #kubernetes #devops #platformengineering
Like Comment
To view or add a comment, sign in
TheMoonMag.com

44 followers
8mo
Report this post
📈 StackOS is empowering #DePINs with the decentralized compute power they need to build the future. If you want to learn more about @DeployOnStackOS , you can navigate to our issue#08 and delve deeper into this vibrant project! 🚀🔍 https://lnkd.in/gz6ByN_p (page 30)
Like Comment
To view or add a comment, sign in
Ton van Leeuwen

Sales Finance & Enterprise sector - F5, Applications secure, fast & high-available in any public and private cloud
5mo
Report this post
NVIDIA Triton Inference Server is a powerful tool for deploying machine learning models in production environments, specifically designed to run on Kubernetes. Learn how NGINX Plus Ingress Controller can provide secure external access -as well as load balancing- to a Kubernetes-hosted NVIDIA Triton Inference Server cluster!

How I did it - "Securing Nvidia Triton Inference Server with NGINX Plus Ingress Controller" | DevCentral

community.f5.com
Like Comment
To view or add a comment, sign in
Larry Zhang

Regional Vice President @ F5 | Cloud Computing & AI
5mo Edited
Report this post
NVIDIA Triton Inference Server is a powerful tool for deploying machine learning models in production environments, specifically designed to run on Kubernetes. Learn how NGINX Plus Ingress Controller can provide secure external access -as well as load balancing- to a Kubernetes-hosted NVIDIA Triton Inference Server cluster!

How I did it - "Securing Nvidia Triton Inference Server with NGINX Plus Ingress Controller" | DevCentral

community.f5.com
Like Comment
To view or add a comment, sign in
Liz Herron

Strategic Leader discovering and driving optimal customer outcomes.
5mo
Report this post
NVIDIA Triton Inference Server is a powerful tool for deploying machine learning models in production environments, specifically designed to run on Kubernetes. Learn how NGINX Plus Ingress Controller can provide secure external access -as well as load balancing- to a Kubernetes-hosted NVIDIA Triton Inference Server cluster!

How I did it - "Securing Nvidia Triton Inference Server with NGINX Plus Ingress Controller" | DevCentral

community.f5.com
Like Comment
To view or add a comment, sign in

2,927 followers

64 Posts

View Profile Follow

Roy Budhaditya’s Post

More Relevant Posts

Explore topics