Since its release, the public has been playing with ChatGPT and seeing what it can do, but how does ChatGPT actually work? While the details of its inner workings have not been published, we can piece together its functioning principles from recent research.
ChatGPT is the latest language model from OpenAI and represents a significant improvement over its predecessor GPT-3. Similarly to many Large Language Models, ChatGPT is capable of generating text in a wide range of styles and for different purposes, but with remarkably greater precision, detail, and coherence. It represents the next generation in OpenAI’s line of Large Language Models, and it is designed with a strong focus on interactive conversations.
The creators have used a combination of both Supervised Learning and Reinforcement Learning to fine-tune ChatGPT, but it is the Reinforcement Learning component specifically that makes ChatGPT unique. The creators use a particular technique called Reinforcement Learning from Human Feedback (RLHF), which uses human feedback in the training loop to minimize harmful, untruthful, and/or biased outputs.
We are going to examine GPT-3’s limitations and how they stem from its training process, before learning how RLHF works and understand how ChatGPT uses RLHF to overcome these issues. We will conclude by looking at some of the limitations of this methodology.
In the context of machine learning, the term capability refers to a model’s ability to perform a specific task or set of tasks. A model’s capability is typically evaluated by how well it is able to optimize its objective function, the mathematical expression that defines the goal of the model. For example, a model designed to predict stock market prices might have an objective function that measures the accuracy of the model’s predictions. If the model is able to accurately predict the movement of stock prices over time, it would be considered to have a high level of capability for this task.
Alignment, on the other hand, is concerned with what we actually want the model to do versus what it is being trained to do. It asks the question “is that objective function consistent with our intentions?” and refers to the extent to which a model’s goals and behavior align with human values and expectations. For a simple concrete example, say we train a bird classifier to classify birds as either “sparrows” or “robins” and we use log loss (which measures the difference between the predicted probability distribution of the model and the true distribution) as the training objective, even though our ultimate goal is a high classification accuracy. The model might have low log loss, i.e. the model’s capability is high, but poor accuracy on the test set. In fact, the log loss is not perfectly correlated with accuracy in classification tasks. This is an example of misalignment, where the model is capable of optimizing the training objective but poorly aligned with our ultimate goal.
Large Language Models, such as GPT-3, are trained on vast amounts of text data from the internet and are capable of generating human-like text, but they may not always produce output that is consistent with human expectations or desirable values. In fact, their objective function is a probability distribution over word sequences (or token sequences) that allows them to predict what the next word is in a sequence (more details on this below).
In practical applications, however, these models are intended to perform some form of valuable cognitive work, and there is a clear divergence between the way these models are trained and the way we would like to use them. Even though a machine calculated statistical distribution of word sequences might be, mathematically speaking, a very effective choice to model language, we as humans generate language by choosing text sequences that are best for the given situation, using our background knowledge and common sense to guide this process. This can be a problem when language models are used in applications that require a high degree of trust or reliability, such as dialogue systems or intelligent personal assistants.
While these powerful, complex models trained on huge amounts of data have become extremely capable in the last few years, when used in production systems to make human lives easier they often fall short of this potential. The alignment problem in Large Language Models typically manifests as:
- Lack of helpfulness: not following the user’s explicit instructions.
- Hallucinations: model making up unexisting or wrong facts.
- Lack of interpretability: it is difficult for humans to understand how the model arrived at a particular decision or prediction.
- Generating biased or toxic output: a language model that is trained on biased/toxic data may reproduce that in its output, even if it was not explicitly instructed to do so.
But where does this alignment problem stem from, concretely? Is it the very way language models are trained inherently prone to misalignment?
Reinforcement Learning from Human Feedback
The method overall consists of three distinct steps:
- Supervised fine-tuning step: a pre-trained language model is fine-tuned on a relatively small amount of demonstration data curated by labelers, to learn a supervised policy (the SFT model) that generates outputs from a selected list of prompts. This represents the baseline model.
- “Mimic human preferences” step: labelers are asked to vote on a relatively large number of the SFT model outputs, this way creating a new dataset consisting of comparison data. A new model is trained on this dataset. This is referred to as the reward model (RM).
- Proximal Policy Optimization (PPO) step: the reward model is used to further fine-tune and improve the SFT model. The outcome of this step is the so-called policy model.
Step 1 takes place only once, while steps 2 and 3 can be iterated continuously: more comparison data is collected on the current best policy model, which is used to train a new reward model and then a new policy.
Let’s now dive into the details of each step!
Step 1: The Supervised Fine-Tuning (SFT) model
The first step consists in collecting demonstration data in order to train a supervised policy model, referred to as the SFT model.
- Data collection: a list of prompts is selected and a group of human labelers are asked to write down the expected output response. For ChatGPT, two different sources of prompts have been used: some have been prepared directly from the labelers or developers, some have been sampled from OpenAI’s API requests (i.e. from their GPT-3 customers). As this whole process is slow and expensive, the result is a relatively small, high-quality curated dataset (of approximately 12-15k data points, presumably) that is to be used to fine-tune a pretrained language model.
- Choice of model: instead of fine-tuning the original GPT-3 model, the developers of ChatGPT opted for a pretrained model in the so-called GPT-3.5 series. Presumably the baseline model used is the latest one
text-davinci-003
, a GPT-3 model which was fine-tuned mostly on programming code.
Quite interestingly, therefore, in order to create a general purpose chatbot like ChatGPT, the developers decided to fine-tune on top of a “code model” rather than a pure text model.
Due to the limited amount of data for this step, the SFT model obtained after this process is likely to output text which is still (probabilistically) not very user-attentive and generally suffers from misalignment, in the sense explained in the above sections. The problem here is that the supervised learning step suffers from high scalability costs.
To overcome this problem, instead of asking human labelers to create a much bigger curated dataset, a slow and costly process, the strategy is now to have the labelers rank different outputs of the SFT model to create a reward model –let’s explain this in more detail in the following section.
Step 2: The reward model (RM)
The goal is to learn an objective function (the reward model) directly from the data. The purpose of this function is to give a score to the SFT model outputs, proportional to how desirable these outputs are for humans. In practice, this will strongly reflect the specific preferences of the selected group of human labelers and the common guidelines which they agreed to follow. In the end, this process will extract from the data an automatic system that is supposed to mimic human preferences.
Here’s how it works:
- A list of prompts is selected and the SFT model generates multiple outputs (anywhere between 4 and 9) for each prompt.
- Labelers rank the outputs from best to worst. The result is a new labeled dataset, where the rankings are the labels. The size of this dataset is approximately 10 times bigger than the curated dataset used for the SFT model.
- This new data is used to train a reward model (RM). This model takes as input a few of the SFT model outputs and ranks them in order of preference.
As for labelers it is much easier to rank the outputs than to produce them from scratch, this process scales up much more efficiently. In practice, this dataset has been generated from a selection of 30-40k prompts, and a variable number of the generated outputs (for each prompt) is presented to the each labeler during the ranking phase.
Step 3: Fine-tuning the SFT model via Proximal Policy Optimization (PPO)
Reinforcement Learning is now applied to fine-tune the SFT policy by letting it optimize the reward model. The specific algorithm used is called Proximal Policy Optimization (PPO) and the fine-tuned model is referred to as the PPO model.
What is PPO? Here are the main takeaways of this method:
- PPO is an algorithm that is used to train agents in reinforcement learning. It is called an “on-policy” algorithm because it learns from and updates the current policy directly, rather than learning from past experiences as in “off-policy” algorithms like DQN (Deep Q-Network). This means that PPO is continuously adapting the current policy based on the actions that the agent is taking and the rewards it is receiving.
- PPO uses a trust region optimization method to train the policy, which means that it constrains the change in the policy to be within a certain distance of the previous policy in order to ensure stability. This is in contrast to other policy gradient methods which can sometimes make large updates to the policy that can destabilize learning.
- PPO uses a value function to estimate the expected return of a given state or action. The value function is used to compute the advantage function, which represents the difference between the expected return and the current return. The advantage function is then used to update the policy by comparing the action taken by the current policy to the action that would have been taken by the previous policy. This allows PPO to make more informed updates to the policy based on the estimated value of the actions being taken.
In this step, the PPO model is initialized from the SFT model, and the value function is initialized from the reward model. The environment is a bandit environment which presents a random prompt and expects a response to the prompt. Given the prompt and response, it produces a reward (determined by the reward model) and the episode ends. A per-token KL penalty is added from the SFT model at each token to mitigate over optimization of the reward model.