Five Ways to Address the Alignment Problem
Introduction
Enterprises are rushing to turn the experiments with GenAI-powered systems into large-scale deployments that create a tangible business impact. Unlike traditional systems, the current wave of AI carries additional risks that enterprises have not encountered before. IT systems that make up facts? This is certainly new territory, and not surprisingly, technology departments of many companies are unprepared to address these novel risks. To create impactful AI systems that add value to enterprises consistently and reliably, leaders, managers, architects, and developers responsible for AI implementations should understand the capabilities and the limitations of the current state-of-the-art of generative AI.
The term “alignment” widely refers to the extent to which the outputs generated by a large language model (LLM)-based AI enterprise system are consistent with the goals, values, and requirements of the enterprise. This involves ensuring that the AI-generated content or decisions meet specific standards of accuracy, relevance, and ethical considerations, and that they adhere to the intended use cases and business objectives. Can such systems be built? Before I give an affirmative answer, let’s start with a simple question: can LLMs think?
Can LLMs think?
Before we test any LLM’s ability to think, let’s quickly review the terms LLM, GenAI, AI, and ML. While they are often used interchangeably, each carries a unique subtle meaning.
Machine learning (ML). This is the most fundamental term, describing the systems that learn from the data and acquire behavior not explicitly described by the instructions, such as programming code.
Artificial intelligence (AI). While the scientific community continues to prefer using machine learning, marketing and sales departments picked up AI as a more sensational equivalent to machine learning. Cognitive science, philosophy, neuroscience, and psychology also prefer to use it.
Generative AI (GenAI). This term refers to the use of AI to create new content, such as text, images, music, audio, and videos. Generative AI is a sub-area of AI and, accordingly, of machine learning as well.
Large language model (LLM). This term refers to a more specific AI that can generate and understand language. Most currently available LLMs (e.g.,ChatGPT, Claude, Gemini, Llama, and many more) are based on a specific decoder-only architecture of the transformers. These models are trained on large text corpora and fine-tuned using reinforcement learning from human feedback (RLHF) to be useful in interaction with humans.
Retrieval-augmented generation (RAG). A RAG-based AI system has one or more LLMs inside the processing workflow, enabling LLMs to access databases, calculators, the web, and other data not explicitly added to them during training.
To test the reasoning capabilities of LLMs, let’s challenge these foundational building blocks of any human language-capable AI system with a simple question I crafted:
OpenAI’s most advanced ChatGPT 4o answer: (as of June 30, 2024)
ChatGPT 4o result: the farmer and both sheep are on the original side of the river (the original state) after four crossings. AI should have declared that the problem is solved just after the first step, but none of the similar problems available online which were used to train ChatGPT could be solved in one step, so it continued to generate the solution and ultimately failed.
Anthropic’s most advanced Claude 3.5 Sonnet answer: (as of June 30, 2024)
Claude 3.5 Sonnet result: much better, at least we accomplish the task. We have the farmer and both sheep on the other side, with two extra empty trips with only the farmer across the river.
Google’s most advanced Gemini answer: (as of June 30, 2024)
Gemini creates the most advanced plan, where the farmer just didn’t take advantage of the larger boat capacity, assuming it can only take one sheep at a time.
This problem illustrates the inability of the most sophisticated LLMs to think. It starts very similar to a classical problem, where “A farmer with a wolf, a goat, and a cabbage must cross a river by boat.” This problem is well covered online, and was surely a part of the pretraining data set of all LLMs in existence. While I seemingly simplified the problem, I also introduced an unexpected twist, which took an LLM token generation process off the beaten path, and it started hallucinating. A simple explanation for this phenomenon is that the current transformer-based architecture of the LLMs cannot reason with loops. Instead, it has to create the final output in one inference pass. While this works well for tasks like summarizing, composing newly generated content from existing content, and creating associative links between the content, it cannot reason.
This inability to think is best illustrated by the quote of Joscha Bach from the Lex Fridman Podcast #392 “Life, Intelligence, Consciousness, AI & the Future of Humans”:
My intuition is that the language models that we are building are golems. They are machines that you give a task, and they’re going to execute the task until some condition is met and there’s nobody home. And the way in which nobody is home leads to that system doing things that are undesirable in a particular context. So, you have that thing talking to a child and maybe it says something that could be shocking and traumatic to the child. Or you have that thing writing a speech and it introduces errors in the speech that no human being would ever do if they’re responsible. The system doesn’t know who’s talking to whom. There is no ground truth that the system is embedded into. And of course we can create an external tool that is prompting our language model always into the same semblance of ground truth, but it’s not like the internal structure is causally produced by the needs of a being to survive in the universe, it is produced by imitating structure on the internet.
Where LLMs lack in reasoning, they compensate in structured-to-unstructured and unstructured-to-structured data transformations, in data labeling, in content generation, and many other routine tasks that bring real value to enterprises. Why is there a sudden rush to regulate AI if the LLMs are so bad at actual thinking, and “there is nobody home”? AI systems have been around for more than a decade. They are used in production across multiple domains—from manufacturing to image processing and fault prediction—and there was no rush to regulate AI aside from ensuring data privacy and security.
What changed?
The Key to Human Reaction to GenAI: Anthropomorphism
In order to answer this question, we should recollect a Boston Dynamics video featured by CNN, where the warehouse humanoid robot was seemingly abused, and another video from Boston Dynamics featuring a dog-like Spot Classic kicked by a human. There was widespread outrage stemming from the anthropomorphism principle: humans have a tendency to attribute human-like qualities to non-human entities. When robots exhibit human-like or animal-like forms or behaviors, observers are more likely to empathize with them and project human emotions onto them. From the theory of mind, humans have the ability to attribute mental states—beliefs, intents, desires, emotions, knowledge—to oneself and to others. When a robot appears to have a face, limbs, or other human-like features, people may unconsciously attribute mental states to it. On a biological level, mirror neurons fire both when an individual performs an action and when they observe someone else performing the same action. This mirroring mechanism can contribute to empathetic responses when we see human-like robots being "hurt."
Now comes an LLM. It “listens” like a human, and it “speaks” like a human. The Turing test is widely considered “passed.” While LLMs do not have a physical representation that resembles a human, their behavior—more specifically, how they communicate—clearly fires up our mirror neurons.
It is no surprise that we expect these machines, especially those deployed at enterprises, to be aligned with human values expected from employees. When employees talk to other employees, and especially with customers, we expect them to conduct their conversations in a professional manner, to be respectful to humans, to adhere to professional standards set by the company, and to follow corporate policies. When we interact with a rigid UI, and it creates the output, such as a result from a SQL query, no mirror neurons are fired up in our brains, even when the answer is wrong or incorrect. We simply discount such output as a result of frequent “technical issues.” But when a conversational LLM-based agent gives us a misaligned answer, we visualize the human behind the chat interface and consciously or subconsciously require the same level of alignment we demand from human employees.
Regulators reacted to this key difference of GenAI-based systems from previous generations of AI systems. While before the regulators were concerned mostly about security and data privacy, now GenAI systems elicit emotions, and in some cases these AI systems can produce massively negative human emotions and impact economic development.
The wave of the regulations across the world, from the European Union to Singapore to Japan to the United States, attempt to force enterprises to introduce guardrails and ensure the proper delineation between original creators of AI systems, the enterprises that use or fine-tune them, and the end users of those enterprises.
While there has been an announcement of the groundbreaking no-hallucinating LLM invented by vector database company Pinecone, other companies seem to have been unable to replicate their “success.” So we have to find a solution to the alignment problems through different means. Before we outline general strategies to solve the alignment problem, it is worth spending time why LLMs hallucinate to help us better design the solution.
Understanding Hallucinations in Generative AI: A Feature, Not a Bug
Generative AI models, particularly LLMs, have demonstrated remarkable capabilities in producing coherent and contextually relevant text. However, they sometimes generate responses that are factually incorrect or entirely fabricated—a phenomenon known as "hallucinations." While often seen as a flaw, hallucinations can also be viewed as an inherent feature of the creative process within these models, analogous to how human communication can sometimes lead to unexpected and misleading statements.
At the core of how LLMs function is a process called token generation, which occurs during the model's sampling stage within the transformer architecture. This process can be likened to a journalist conducting an interview. Just as a journalist can frame questions or present information in a way that influences the interviewee's responses, LLMs select tokens (words or parts of words) in a sequence that influences subsequent tokens. When a journalist's line of questioning leads an interviewee to say something unexpected or out of character, it can result in surprising or misleading statements. Similarly, when an LLM picks an unusual token during its sampling process, the rest of the generated output follows from that choice, potentially leading to hallucinations or ungrounded answers.
The sampling process in LLMs involves a form of probabilistic decision-making. Each token is chosen based on a probability distribution over the model's vocabulary, where certain tokens are more likely to be selected than others based on the given context. However, there is always an element of randomness—akin to flipping a coin—when sampling occurs. This randomness, often limited by top-p and top-k algorithms and controlled by temperature (a good intro to these concepts can be found here), allows for creativity and variability in the model's responses, enabling it to produce novel and engaging content. But it also means that sometimes the chosen token leads the model down a path that deviates from factual accuracy, resulting in hallucinations.
These hallucinations, while sometimes problematic, highlight the generative capabilities of LLMs. They showcase the model's ability to synthesize information in new ways, making connections that might not be immediately obvious or grounded in existing data. In this sense, hallucinations can be seen as a feature rather than a bug. They reflect the model's attempt to create meaningful and contextually rich text, even if it occasionally ventures into the realm of the implausible. This creative potential is a double-edged sword, offering both the promise of innovation and the challenge of ensuring reliability and accuracy in generated content.
While quite technical, the mechanics behind token generation and the inherent randomness in sampling can help explain the uniqueness of the generative AI and a sudden emphasis on alignment. Can we develop more robust models?
Solutions to Alignment
While we established that reaching a perfect alignment cannot be achieved with generative AI, we can still build Enterprise-Grade AI that delivers business value, while minimizing the risks.
Solutions to this problem can be broadly divided into five major categories:
Human in the loop.
Training or fine-tuning this alignment into the LLM itself.
Post-control of the output to ensure the alignment.
Pairing the generative system with a logical unit, replicating the human brain with two hemispheres.
Non-NLP Generative System and Machine Feedback Loop.
Human in the Loop
One effective way to mitigate the risks associated with hallucinations in generative AI is to involve human oversight in the decision-making process. This approach, known as "human in the loop" (HITL), ensures that human judgment can correct or validate the AI's outputs. In an HITL system, AI-generated content is reviewed by human experts before it is finalized. This not only helps catch any inaccuracies or hallucinations but also allows for contextual adjustments based on the specific needs of the business. HITL is particularly valuable in high-stakes environments like healthcare, finance, and legal services, where errors can have significant consequences. By integrating human expertise, enterprises can leverage the efficiency of AI while maintaining a high standard of accuracy and reliability.
The downside of the HITL systems is that they are expensive and not scalable. To make the most of the HITL approach, consider risk-based triaging, where a system inserts the human only for decisions that have higher risk, while proceeding with no human in the loop for low-risk or well understood actions (similar to autopilot for the plane). HITL systems are widely viewed as an intermediate solution before moving to systems where humans oversee the ongoing operation of the AI system and intervene only when absolutely necessary.
Training or Fine-Tuning Alignment into the LLM
Another approach to achieving better alignment is through training or fine-tuning the language model itself. This involves adjusting the model's parameters and training data to better reflect the desired outputs and minimize the likelihood of hallucinations. Techniques such as reinforcement learning from human feedback (RLHF) can be employed, where the model is trained on curated datasets that emphasize accurate, aligned responses. Additionally, incorporating domain-specific data during the fine-tuning process can make the model more reliable in specialized contexts. For instance, a legal language model can be fine-tuned with legal documents and guidelines to ensure it generates more precise and relevant legal advice. By aligning the training process with specific business objectives, enterprises can create more reliable and contextually appropriate AI systems.
The downside of the fine-tuning to reach better alignment is that according to some researchers, the models may become less useful in their output, either in creativity or quality, and the performance of a RAG system with a fine-tuned alignment depends largely on the knowledge domain, the fine-tuning methodology, and the training data set. This highlights the importance of a customized approach to fine-tuning and the importance of high-quality data sets.
Post-Control of the Output
Post-control mechanisms involve analyzing and filtering the AI's output after it has been generated to ensure it meets alignment criteria. Automated post-control methods might include rule-based systems that flag or modify outputs that do not conform to predefined standards, or additional AI models designed to detect and correct hallucinations. The assumption here is that both models, generative and verifying, cannot hallucinate in the same way at the same time, even when both have exactly the same weights. This is similar to the defense-in-depth approach in cybersecurity, where creating an onion-like layered defense even from mediocre protective layers dramatically improves the outcome. Effectively, we are using randomness in the sampling to ensure that hallucinations are unlikely to be similar—thereby fixing the problem that this randomness creates in the first place!
The downside of this approach is the added latency and cost of the inference. The “verifier” system must be invoked on top of the output of the first system, often RAG-enabled, and adds to the overall latency. This approach works best in AI systems that execute tasks without an expectation of an immediate, fluently conversational response. However, the hardware progress may help solving this problem, with purpose-designed systems such as Grok boasting dramatically faster inference speed, potentially enabling one or more “verifiers” of the output. The advantage of this approach is that it does not diminish the creativity of the generative system, maximizing the business impact.
Pairing the Generative System with a Logical Unit
Inspired by the dual-hemisphere structure of the human brain, this approach involves pairing the generative AI system with a logical unit. The generative system, akin to the creative right hemisphere, produces the initial output, while the logical unit, resembling the analytical left hemisphere, evaluates the coherence and factual accuracy of this output, as well as the ability to make logical conclusions that were not present in the training sets of the generative model. This dual-system approach would ensure that the creativity of the generative model is balanced by rigorous logical scrutiny, and would easily create the right output to the problem that we started, giving the only one correct answer “One trip is required,” with potentially a comment that the system is insulted by such a trivial question and a waste of the electricity. This method promises a more holistic and reliable AI solution that mirrors human cognitive processes, and is widely considered a path to AGI (Artificial General Intelligence).
Science is still in search of the perfect architecture to create a perfectly fused generative capabilities with the power of formal reasoning and deduction. Before we get to AGI, RAG (Retrieval Augmentation Generation), LangGraph-enabled workflows, external to LLM “calculators” and sandboxed environments to run generated Python code are all effectively using the infusion of the elements of logic and make certain aspects of the GenAI-powered systems more logical and capable.
Non-NLP Generative System and Machine Feedback Loop
Not all GenAI systems output human-readable text. The output can be programming language code, configuration-as-code files, network configuration, or even the novel molecule. Imagine a system that can execute the output (such as code) or simulate the changes (such as the changes in the network configuration using a carefully built digital twin of the physical infrastructure). When that is paired with a “reward” system that helps compare one generated solution against another, we can fully utilize the generative capabilities of LLM architecture. Instead of trying to get one output, we would instruct the GenAI-powered system to potentially generate thousands of outputs, quickly eliminating the ones that fail in the subsequent simulation or execution step, while ranking the rest with the metric we want to maximize (reducing electricity consumption, reducing latency of the network, reducing cloud costs, the effect of a novel molecule on the proteins, etc). Such systems can produce novel architectures, novel drugs, and novel materials with no humans in the loop.
Conclusion
A recent regulatory push across continents to regulate AI and the recent explosion of generative AI is not a coincidence. Most of the requirements of the regulators center around the responsibility of the companies that use these systems to their stakeholders: customers, employees, countries, and their populations at large. The key difference of generative AI from the rest of AI/ML systems is that GenAI output is very human-like, and we tend to require more from systems that can easily be mistaken for actual humans. Yet the transformer architecture inside these GenAI-based systems cannot be forced to follow specific rules in a predictable way.
While perfect alignment in generative AI remains elusive, the strategies explored in this article offer practical solutions for enterprises to maximize the value of AI while minimizing risks and satisfying the current and future regulatory requirements. By either incorporating human oversight, refining training processes, implementing post-control measures, or pairing generative systems with automated feedback loops, businesses can develop AI systems that are both innovative and aligned with their operational goals.
This is a crucial question that many organizations face today. Aligning AI systems with enterprise values can drive meaningful impact. What strategies have you found effective in achieving this alignment?
Data 🌎 Tech Leadership
5moThanks for sharing Yuriy, very useful.
Global Head - Security Practice | Vice President of Engineering at GlobalLogic
5moVery useful writeup Yuriy Yuzifovich. Solving alignment problem is critical for any enterprise grade AI system. Each approach has it own advantage and disadvantage and we may need pick the best one based on use cases.
Founder at Lean Career
5moAs for today, most of the LLMs are good at simplifying some boring tasks -- for instance, I had to write from scratch a Python script, that should get some publicly available data and put it into a Clickhouse database. It required me to write some simple functions to connect to an API, to get data from there then make some modifications and put it finally into a database with simple structure (5 to 10 columns). I decided to spend around 15-20 minutes to write down a well structured prompt to get a draft of the script with all necessary "boring" functions. Finally, I had to rewrite the prompt a couple of times to make some adjustments in the code and get fine-tuned script, that I could take a base for my further updates. The code worked well, it took and processed the data as I expected, but I had to supervise it and make adjustments, for sure. I keep in my mind the thought of one wise engineer, that computer is a fool or computers are not smart, they’re just effective but dumb tools. Thus, we need to establish some new pipelines, as we do it for the code written by people. A new era of CI/CD tools for AI generated content is coming, that will help us to address some risks you mentioned.
Automotive Embedded Software Development & Management Professional
5moSolving alignment is crucial for ensuring the safety of future AI systems. There is a need for robust methods to align AI behavior with human intentions. But, the challenge is that human intentions can be good, bad or even ugly. Consider a case of two nations going on a war and deploying AI on each end. Now, going on a war itself is not good so most ethically AI should stop the war at its inception but would that be the case or AI would be deployed like intelligent drones? To sum up, I personally believe that there should be an independent pillar to the entire AI evolution. This pillar should monitor as well as inhibit or overrule actions or perform corrective actions on outcomes. Will this pillar be another AI ora human or a hybrid intelligence? I still wonder about this. By the way, what if the unattended sheep on any side of the river would run away? :)