A Product Manager’s Take on LLMs
I’ve always been fascinated by AI, but let’s face it—training models never quite sparked my excitement. I love seeing the end results, but I’ve come to terms with the fact that the slow, methodical process of building those models just isn’t for me.
However, LLMs (Large Language Models) have reignited that AI curiosity I felt back in 2022. They’ve opened up doors for creating better products in ways that feel less like data science and more like product innovation.
I don’t claim to be an expert on the technical intricacies behind LLMs. My understanding doesn’t go much deeper than embeddings, and I’m not here to talk about what I can’t fully grasp. Instead, I’ll focus on how LLMs fit into the product landscape and how I think about their role in driving business outcomes.
Embracing Indeterminism
As a product manager, I’m used to thinking about predictable, reliable user experiences. You validate data, and from that point on, you trust it. You build your logic around clear inputs and outputs.
But LLMs break that mold. They introduce a level of unpredictability inside your product that’s a bit unsettling, especially when the product’s success hinges on consistency.
For example, when you call OpenAI’s API, you won’t get the same answer every time. You could ask the model for structured data like JSON, and it might give it to you—most of the time. Other times, it hallucinates and adds unexpected fields.
This isn’t a flaw; it’s just the nature of LLMs as they stand today. But as a product manager, you have to consider what this means for the reliability of your product’s core functionality. Can you trust an LLM’s response when it might vary each time?
Think of it this way: when querying a database or hitting an external service, failures typically mean you get no response or an error. With LLMs, it’s different. You might get a response, but its accuracy or structure can be inconsistent.
Speed vs. Stability
The LLM space is evolving at breakneck speed. In fact, by the time I finish writing this, there will probably be new features available. OpenAI, for instance, now offers parameters that help ensure JSON responses are more reliable, and they’ve added options for determinism to provide consistent outputs.
This pace reminds me of the rapid growth we saw in JavaScript frameworks a few years ago. Take Langchain as an example—a framework designed to create more complex workflows with LLMs. It helps you feed models with data, chain multiple models together, and create agents to refine questions. I remember reading the docs, and as I was checking GitHub for an issue, the methods I was using were already deprecated!
In product management, we like stable processes and well-defined roadmaps. But with LLMs, the landscape is too new for those foundations. You have to adapt to an ever-shifting field where the rules are being written in real time.
RAG: The Practical Approach
One emerging approach that makes a lot of sense for teams building LLM applications is Retrieval-Augmented Generation (RAG). This is a clever way of giving your model extra context without having to go through the slow process of training it on new data.
Here’s why it works: LLMs can only provide answers based on what they’ve been trained on. If they don’t know something, they’ll happily hallucinate a response. RAG fixes that by retrieving relevant documents (stored as vectors) to give the model more context for its response.
Essentially, you’re augmenting the LLM’s capabilities by feeding it the right data at the right time, without needing to retrain anything. This gives you faster iteration cycles and allows your product to deliver more accurate results without the heavy lifting of model training.
Training Models: A Bottleneck
Training custom models is slow. In most cases, small and mid-sized teams will find that training models just doesn’t make sense from a time or cost perspective. You can fine-tune GPT or train your own models, but the overhead—both in time and resources—can bog down progress.
Prompting an LLM or using a RAG pipeline, on the other hand, allows you to iterate rapidly. Change a prompt, and you’re ready to deploy after a single pull request. Retraining a model? That’s a far more complex process.
For most teams, the faster iteration cycles made possible through prompt engineering and RAG are far more valuable than the benefits of custom training.
Handling Errors & Testing in LLMs
LLMs introduce new layers of complexity when it comes to errors and testing. Every request is an API call, meaning network issues and timeouts are a constant risk. Beyond that, the variability in LLM responses can lead to unexpected issues.
For example, what do you do when an LLM fails halfway through a task? How do you handle retries? If the response is cut off or the model hallucinates data that breaks your logic, do you have mechanisms in place to roll back any changes?
Testing LLM-driven features is also uncharted territory. Traditional testing frameworks expect predictable outputs, but LLMs are inherently non-deterministic. So how do you ensure a prompt will keep producing acceptable results? Many teams are now vectorizing their ideal responses and comparing them to actual outputs using cosine similarity, which is very different from traditional unit testing.
Streaming Responses & User Experience
From a product perspective, user experience is paramount. And when dealing with LLMs, waiting for the entire response to load before showing anything isn’t always ideal.
We’ve gotten used to streaming word-by-word responses, much like OpenAI does with their chat interface. But when you’re streaming structured data like JSON, things get tricky. The response comes in character by character, and if the structure isn’t complete, you’re left with broken data that can lead to errors.
In these cases, you have to get creative. For example, I’ve had to use libraries that optimistically close the JSON to avoid errors, but even then, handling more complex data structures like arrays becomes challenging.
LLMs: Strengths and Weaknesses
At first, I fell for the hype, thinking LLMs could handle any computational task. I quickly learned that they’re terrible at basic math. Ask one to count the items in an array, and you’ll get an approximation, not an accurate result.
But what LLMs are good at is understanding intent. I had a use case where users needed to submit plain English commands that would generate UI elements. While the LLM was good at generating objects based on schemas, it frequently hallucinated fields that broke the UI.
I realized that the value of LLMs isn’t in generating the structured data itself—it’s in mapping user intent to functions in your code. This approach, called function calling, is becoming more common, allowing LLMs to act as interpreters that trigger specific actions in your application rather than relying on them for perfect outputs.
The Future of Prompt Engineering
I see prompt engineering becoming a fundamental skill, not a specialized discipline. Much like testing has become part of the everyday workflow for engineers, prompt engineering will become a core competency for product teams.
There may be a few specialists in the future who focus entirely on refining prompts and optimizing models, but for most teams, it will be an additional tool in their toolkit. Writing effective prompts will live alongside writing tests and other best practices in product development.
Product Management (Mobile/Web Platform) | Product Strategy | Digital Marketing | Growth | B2C | B2B SaaS | eCommerce
2moThanks for sharing this. Quite comprehensive analysis.
PG-Diploma Big Data Analytics |Big Data , Database,Data Analysis | SQL | Python | ML | Tableau
2moInteresting and Informative