Jawad Amin’s Post

View profile for Jawad Amin, graphic

AI @ Microsoft | ex-Apple | uWaterloo alum

As organizations think about Generative AI at scale, the biggest challenge I see teams facing is optimizing these three aspects in tandem: - Cost - Performance - Quality The problem is that it’s impossible to optimize for each variable independently. If you want higher quality of responses, then you may need multi-turn LLM loops, which cost more and make inference slower. On the other hand, if you want cheaper inference, you may have to go with a smaller model with lower response quality, or you may have to live with “noisy neighbours” (https://lnkd.in/gTkbyNv9) which affects performance. So - how do we optimize for the three nodes of this inference triangle? You can’t optimize what you can’t measure - so as you think about moving GenAI workloads to production, the first step is to capture metrics: - request and token volumes for cost - Time-to-first-token and Tokens Per Second for performance - Evaluation metrics such as accuracy, groundedness, relevance, F1 scores etc. for quality control Typically, when there is a breakthrough, it comes in at least two of these three dimensions. For example: GPT-4-Turbo is both better quality and cheaper. The Groq LPU architecture is both cheaper and faster. It is important to have good measurements for each dimension so you can fully take advantage of GenAI breakthroughs as they come.

  • diagram
Mark Persaud

AI Leader | Ex Chief Data Officer | Board Member | Speaker | Polymath | Martial Artist | Photographer I help clients innovate to realize new and differentiated possibilities for their customers.

10mo

Nice post, Jawad Amin. You’ve covered the topic of cost-aware model selection very well. What do you think about other options like: fine tuning models on more domain-specific data, using smaller datasets, caching frequently used responses, reducing large documents into smaller chains while managing token size limits, using abstractive or extractive summarization techniques, monitoring and optimizing LLM usage and performance, optimizing resource allocation, auto scaling based on demand, and regularly monitoring and reviewing models for ethical guideline compliance?

To view or add a comment, sign in

Explore topics