Roy Budhaditya’s Post

View profile for Roy Budhaditya, graphic

Engineering Leadership | Data & AI Engineering | Building Scalable, Modern Data & AI Platforms

It is very apparent that open LLMs are gaining more popularity, and Kubernetes is the most scalable serving choice. There are two serving frameworks that support full Kubernetes deployment: Kserve (CNCF project) and RayServe. Additionally, as a physical inference server, NVIDIA Triton serves open LLMs extremely well. #knative #kserve #rayserve #kuberay #nvidiatriton Oftentimes, I get questions about which one to use for Kubernetes inference, and the answer is that it depends. Rather than focusing on the framework, it’s better to focus more on custom load balancing and autoscaling of serving, gather key performance metrics, use KV cache utilization, and other performance utilization metrics. Gathering LLM performance data and solving accordingly is the most critical part of MLOps. #kubernetes #mlops #openllm

To view or add a comment, sign in

Explore topics