Natural Language Understanding and Conversational AI
Natural Language Understanding (NLU) together with Natural Language Generation (NLG) is part of the wider field of Natural Language Processing (NLP). NLU is a key Artificial Intelligence (AI) application for Conversational AI (CAI) that allows virtual agents and voicebots/chatbots to understand users.
There are two key aspects to properly understand users:
- Their intent: What does a user want
- Entities or slots: Relevant data for a query of a user
You will find these two notions in every system for Conversational AI. At the same time implementations differ a lot in detail. What are important aspects regarding intent recognition:
- How can a virtual agent provide hundreds of answers accurately?
- How can intents be “trained”?
- How can intent models be tested and validated?
- How can the model results from production be monitored?
- How can the model be improved over time?
All major CAI platforms have machine learning based capabilities to detect intents in user input. Many platforms have additional capabilities on top of “simple” intent recognition models.
A major drawback of “simple” intent recognition models is that they tend to struggle beyond 100-200 intents to accurately identify what a user is asking for – even with a big amount of training data. At the same time many products have additional capabilities:
- At the very least you can use entities/slots to further distinguish the user intent. Depending on entity related platform capabilities that might still limit how many different answers a virtual agent can provide without additional dialogue – for specific user utterances.
- Some platforms allow to customize the NLP pipeline/work with features for the classifier to allow for a more direct control of the machine learning model. This can be very powerful – while it needs a specialized skillset from the team training the model.
- Several platforms have added additional capabilities like synonyms, knowledge graphs to help their intent classification.
- A few platforms support hierarchical intents. That means that ambiguous utterances will be placed higher up in the intent hierarchy while very specific utterances are placed in specific child intents. Successful implementations have scaled to hundreds and thousands of “fine grained” intents this way. This intent notion is somewhat different from other platforms.
- Some platforms allow for additional machine learning based classification to further distinguish the user’s intent beyond a coarse grained “first” level intent. That can always be done via API integration if not supported natively. The more this is needed the more effort it is to build and maintain though and it works best with a simple model applied to virtual agents.
Capabilities to capture entities in CAI platforms differ a lot:
- Predefined entities – often called system entities – tend to cover concepts like date, time, numbers. Note: These entities might not always perform as needed.
- (User defined) entities can be defined with values and synonyms to those values – e. g. value “platinum” with synonyms “premium”, “best level”, etc.
- Entities can be defined based on so called regular expressions to e.g. capture a certain number of digits – but also more complex patterns such as an email address
- More advanced capabilities are not present in all platforms; options can include
- The ability to expand/restrict entities based on existing entities
- The ability to train a machine learning based model to capture entity values or slots based on annotated or labeled data - this can be further elaborated with a slot mapping capability to normalize values extracted from user input
- The ability to define composite entities – combinations of entities that are expected to appear together in user utterances. E.g. a food item together with modifiers and a number like “two large cheeseburgers with extra mayo”.
Let’s take a look at a few retail banking inspired examples:
- “Fee refund”: intent “fee_refund”
- “Overdraft fee refund”: intent “overdraftfee_refund” or intent “fee_refund”, entity “overdraft”
- “I think I have money in my savings account, can you transfer $100 from there to my checking account tomorrow”: intent “money_transfer”, entities: from_account = “savings”, to_account = “checking”, date = “tomorrow” – to be provided in a normalized format
The first two examples outline the difference between a hierarchical approach where the second utterance would be captured by a specific child intent of a more generic fee refund intent and an approach where a bigger “fee_refund” intent is working together with capturing entity values to understand what kind of fee the customer wants refunded.
The third example can only be fully addressed with machine learning based entities to be able to capture the source account being the savings account. This is currently not possible in a number of platforms.
Additional NLU aspects come into play for multi-turn conversations
- Targeted classification at additional conversation turns. This can go as far as having a specific intent machine learning model for every dialog turn to classify as necessary for an additional conversation turn. In its most simple form this is about yes/no questions where we want the ability to understand all possible variations of saying yes and no. This might even be dependent on the question we ask the user and profit from the ability to extend a general mechanism.
- Interruptions/deviations/context switches: A user might start a money transfer flow and then ask for an account balance before continuing the transfer flow.
- It is critical to be able to support targeted interruptions while the ongoing flow needs to have priority over the possibility of an interruption/deviation.
Ideally CAI platforms support both aspects. In practice several platforms lack full support through built-in mechanisms. Platforms are trying to fully automate handling these aspects. Unfortunately, user behavior is too varied for the current approaches to work well. This leads to the need of designing every conversation turn in this regard. A few platforms support this directly and try to make that easy while it can be significant implementation and maintenance effort in other platforms.
Other approaches to NLU for CAI
- Knowledge Graphs are gaining some popularity in recent years. If existing content is very well structured and non-ambiguous this can be a very interesting alternative to the traditional approach with intents and entities.
- Training advanced NLU/NLG models – e.g. so called Sequence to Sequence models – to directly provide an answer to user input without explicit intent/entity/dialog management. As with other automatic NLG methods adoption is primarily inhibited by not knowing exactly what the model (virtual agent) will say upfront. Especially in regulated industries this is a no-go. Even without regulatory restrictions there is a big risk of reputational damage with automated responses.
- This is an area of ongoing research with e.g. Google’s LaMDA and Facebook’s Blender model being able to hold impressive conversations – while still producing unwanted output at times.
- Platforms/products try to answer (FAQ) questions based on documents and lists. Without the ability to steer in detail how user input is classified/treated these approaches struggle to provide significant business value.
- With restrictions placed on documents and their structure this can be a way to provide answers out of knowledge base – not building flows/providing service in channel.
- Other approaches from using the Artificial Intelligence Markup Language (AIML) to language grammars tend to struggle to handle the full range of user utterances well.
Evaluating and Improving NLU for CAI
To properly evaluate an NLU model one needs to have an independent test or validation data set with a reasonable number of varied utterances. There need to be edge cases – testing the “borders” of intents – and negative test cases (utterances that are out of scope). The majority of platforms offer some integrated testing capabilities. Very few offer advanced capabilities providing e.g. a confusion matrix for intents.
Note: When looking at negative or out-of-scope test cases accuracy of NLU models across major vendors differs drastically. In one test scenario I tried accuracy ranged from 23% to 65% for out-of-scope tests across leading providers.
Ongoing production monitoring presents another challenge: “Hard” performance metrics can only be created manually by someone knowing the full scope of the NLU model. This is time-consuming and an activity needed on an ongoing basis. Users typically opt for a strategy where a random sample is analyzed in regular time intervals – or alternatively rely on indirect measurements.
Indirect measurements include feedback from users, whether flows kicked off by user utterances produce a business outcome, flows that are interrupted just after being started and never finish.
The hardest challenge is to address utterances that are out of scope but still fire an intent. Indirect measurements as just described can help identify them – while the only true measure is the result of manual analysis.
Improving the model starts with the data we just discussed to evaluate NLU. The key aspects to look at:
- What intents have low performance metrics e.g. overlap and how to address it
- What topics users frequently ask for that are not covered already: Most platforms have some support this by now; some also support to do this upfront when creating the first NLU model.
- How to make sure out of scope utterances are not firing off flows
Context is key for good CAI flows/answers – it’s not just NLU
So far, we have focused on extracting as much as possible from the user’s input or utterance. A critical capability of platforms is to use context. That primarily is about two things:
- Things we know about the user. If a user provides an ambiguous utterance “what’s my balance” knowing that a user only has a checking account allows to directly provide an answer. For a user with both a checking and savings account each we need to disambiguate – or design a flow that would return both balances.
- Things we know from the conversation. E.g. a user might ask “what’s the balance of my savings account” followed by “Transfer $100 from there to my checking account”. “There” can only be resolved out of conversation context to make it a transfer from the user’s savings account to his or her checking account.
Current platforms support the first context notion via API integration – typically with significant initial and ongoing effort. The second kind of context is supported in principle by some platforms. Unfortunately, automated approaches tend to break in practice by either assuming too much (e.g. keeping context for too long) or too little (not picking up on key context provided).
Executive Director | Artificial Intelligence Product Delivery Leader | Duke MBA Experienced Leader in AI Product Delivery and Digital Transformation
2yGeorg, a very insightful post ! 👏🏽Captures some of the nuances and challenges of Conversational AI.
Linguist at Morgan Stanley, Forensic Linguist, Threat Assessment, and Behavioral Intervention Specialist
2yVery Nice!
Thanks for sharing! Very informative and easy to read!