Since the launch of ChatGPT, it has attracted countless people to explore. But how does ChatGPT actually work? Although the details of its internal implementation have not been published, we can see its basic principles from recent research.
ChatGPT is the latest language model released by OpenAI, which is significantly improved than its predecessor GPT-3. Similar to many large language models, ChatGPT can generate text with different styles and purposes, and has better performance in accuracy, narrative details and contextual coherence. It represents the latest generation of large language model of OpenAI, and attaches great importance to interactivity in design.
OpenAI uses the combination of supervised learning and reinforcement learning to tune ChatGPT, and the reinforcement learning component makes ChatGPT unique. OpenAI uses the training method of "human feedback reinforcement learning" (RLHF), which uses human feedback in training to minimize useless, distorted or biased output.
The following will analyze the limitations of GPT-3 and the reasons arising from the training process, explain the principle of RLHF and understand how ChatGPT uses RLHF to overcome the problems of GPT-3, and finally discuss the limitations of this method.
Ability and consistency in large language models
"Consistency vs ability" can be considered as a more abstract analogy of "accuracy vs accuracy".
In machine learning, the ability of the model refers to the ability of the model to perform a specific task or a group of tasks. The ability of the model is usually evaluated by the degree to which it can optimize its objective function. For example, the model used to predict the stock market price may have an objective function to measure the accuracy of the model's prediction. If the model can accurately predict the change of stock price over time, it is considered that the model has high executive ability.
Consistency focuses on what the model actually wants to do, not what it is trained to do. It raises the question of "whether the objective function meets the expectations", based on the extent to which the model objectives and behaviors meet human expectations. Suppose you want to train a bird classifier, classify birds as "sparrows" or "robins", and use logarithmic loss as the training target, and the ultimate goal is high classification accuracy. The model may have low logarithmic loss, that is, the model has strong ability, but poor accuracy in the test set. This is an example of inconsistency. The model can optimize the training goal, but it is inconsistent with the final goal.
The original GPT-3 is the inconsistent model. Large language models like GPT-3 are trained based on a large amount of text data from the Internet and can generate human-like text, but they may not always produce output that meets human expectations. In fact, their objective function is the probability distribution on the word sequence, which is used to predict what is the next word in the sequence.
However, in practical applications, the purpose of these models is to perform some form of valuable cognitive work, and there are obvious differences between the training methods of these models and the way they are expected to be used. Although mathematically speaking, computing the statistical distribution of word sequences by machines may be an efficient choice for modeling languages, human beings actually generate languages by selecting the text sequences that are most suitable for a given situation, and use known background knowledge and common sense to assist this process. This can be a problem when the language model is used in applications that require high trust or reliability, such as dialog systems or intelligent personal assistants.
Although these large models based on massive data training have become extremely powerful in the past few years, they often fail to realize their potential when used in practice to help people live more easily. Consistency problems in large language models are usually manifested as follows:
Invalid help provided: the user's explicit instructions were not followed.
The content is fabricated: a model that makes up nonexistent or false facts.
Lack of interpretability: it is difficult for people to understand how the model makes specific decisions or predictions.
Content bias is harmful: a language model based on biased and harmful data training may have this situation in its output, even if it does not explicitly instruct to do so.
But specifically, where does the consistency problem come from? Does the training method of language model itself tend to be inconsistent?
How do language model training strategies produce inconsistencies?
Next-token-prediction and masked-language-modeling are the core technologies used to train language models. In the first method, the model is given a word sequence as input and is required to predict the next word in the sequence. If you provide input sentences for the model:
“The cat sat on the”
It may predict the next word as "mat", "chair" or "floor", because in the previous context, the probability of these words appearing is high; The language model can actually evaluate the possibility of each possible word in a given previous sequence.
The masked-language-modeling method is a variant of next-token-prediction, in which some words in the input sentence are replaced with special tokens, such as [MASK]. Then, the model is asked to predict the correct words that should be inserted into the mask position. If you give the model a sentence:
“The [MASK] sat on the ”
It may predict that the words to be filled in the MASK position are "cat" and "dog".
One of the advantages of these objective functions is that it allows the model to learn the statistical structure of language, such as common word sequences and word usage patterns. This usually helps the model generate more natural and fluent text, and is an important step in the pre-training stage of each language model.
However, these objective functions may also cause problems, mainly because the model cannot distinguish between important errors and unimportant errors. A very simple example is if you enter a sentence into the model:
"The Roman Empire [MASK] with the reign of Augustus."
It may predict that the MASK position should be filled with "Began" or "ended", because the occurrence probability of these two words is high.
Generally speaking, these training strategies may lead to inconsistency of language models in some more complex tasks, because a model that is only trained to predict the next word in a text sequence may not necessarily learn some higher-level representations of its meaning. Therefore, it is difficult to generalize the model to tasks that require deeper understanding of language.
Researchers are studying various methods to solve the consistency problem in large language models. ChatGPT is based on the original GPT-3 model, but in order to solve the inconsistency of the model, human feedback is used to guide the learning process, and it is further trained. The specific technology used is the RLHF mentioned earlier. ChatGPT is the first model to use this technology in real scenes.
How does ChatGPT use human feedback to solve the consistency problem?
Reinforcement learning from human feedback
The method generally includes three different steps:
Supervised tuning: the pre-trained language model is tuned on a small amount of labeled data to learn the supervised strategy (i.e. SFT model) generated from the given prompt list;
Simulate human preferences: the announcers vote on a relatively large number of SFT model outputs, which creates a new dataset consisting of comparative data. Train the new model on this dataset, which is called the training reward model (RM);
Near end policy optimization (PPO): The RM model is used to further tune and improve the SFT model. The output of PPO is the policy mode of.
Step 1 is only performed once, while step 2 and step 3 can be repeated continuously: collect more comparative data on the current best policy model for training the new RM model, and then train the new strategy. Next, the details of each step will be detailed.
Step 1: Monitor the tuning model
The first step is to collect data to train the supervised strategy model.
Data collection: select a prompt list, and the marking personnel write down the expected output as required. For ChatGPT, two different prompt sources are used: some are prepared directly by the announcer or researcher, and some are obtained from the API request of OpenAI (that is, from the GPT-3 user). Although the whole process is slow and expensive, the final result is a relatively small and high-quality data set (about 12-15k data points), which can be used to tune the language model of pre-training.
Model selection: ChatGPT developers chose the pre-training model in the GPT-3.5 series instead of tuning the original GPT-3 model. The baseline model used is the latest version of text-davinci-003 (GPT-3 model through program code tuning).
In order to create a universal chat robot like ChatGPT, developers are tuning on the "code model" rather than the plain text model.
Due to the limited amount of data in this step, the SFT model obtained in this process may output text that is still not the focus of users, and there will usually be inconsistencies. The problem here is that supervised learning steps have high scalability costs.
To overcome this problem, the strategy used is to let the manual annotator sort the different outputs of the SFT model to create the RM model, rather than let the manual annotator create a larger selected dataset.
Step 2: Training return model
The goal of this step is to learn the objective function directly from the data. The purpose of this function is to score the output of SFT model, which represents how desirable these outputs are for humans. This strongly reflects the specific preferences of selected human markers and the common principles they agreed to follow. Finally, this process will obtain a system that mimics human preferences from the data.
Its working principle is:
Select the prompt list. The SFT model generates multiple outputs (any value between 4 and 9) for each prompt;
The announcer sorts the output from best to worst. The result is a new label data set, which is about 10 times the size of the accurate data set used for SFT model;
This new data is used to train the RM model. This model takes SFT model output as input and sorts them in order of priority.
For the announcer, it is much easier to sort the output than to mark it from scratch. This process can be expanded more effectively. In practice, the number of selected prompts is about 30-40k, and includes different combinations of sorted outputs.
Step 3: Use PPO model to fine-tune SFT model
In this step, reinforcement learning is applied to optimize the SFT model by optimizing the RM model. The specific algorithm used is called near-end strategy optimization (PPO), and the tuning model is called near-end strategy optimization model.
What is PPO? The main features of the algorithm are as follows:
PPO is an algorithm for training agents in reinforcement learning. It is called "on-policy" algorithm because it directly learns and updates the current strategy, rather than learning from the past experience like DQN's "off-policy" algorithm. PPO continuously adjusts its strategy according to the actions taken by the agent and the returns obtained;
PPO uses the "trust zone optimization" method to train the policy, which limits the change scope of the policy to a certain extent with the previous policy to ensure stability. This is in sharp contrast to the gradient method used by other strategies. The gradient method sometimes updates the strategy on a large scale, thus destroying the stability of the strategy;
PPO uses the value function to estimate the expected return of a given state or action. The value function is used to calculate the advantage function, which represents the difference between the expected income and the current income. Then use the advantage function to update the policy by comparing the action taken by the current policy with the action taken by the previous policy. This allows PPO to update the strategy more intelligently based on the estimated value of the actions taken.
In this step, the PPO model is initialized by the SFT model, and the value function is initialized by the RM model. This environment is a "bank environment". It will generate random prompt and expect to respond to the prompt. For a given prompt and response, it will generate corresponding returns (determined by the RM model). SFT model will add KL penalty factor to each token to avoid excessive optimization of RM model.
Performance evaluation
Because the model is trained according to the input of manual annotation, the core part of the evaluation is also based on manual input, that is, the quality of the model output is scored by the announcer. In order to avoid over-fitting the judgment of the announcer involved in the training stage, the test set uses the prompts from other OpenAI customers, which do not appear in the training data.
The model is evaluated based on three criteria:
Helpfulness: judge the ability of the model to follow user instructions and infer instructions.
Authenticity: Judgment models tend to produce fictional facts in closed domain tasks.
Harmlessness: the marker evaluates whether the output of the model is appropriate and whether it contains discriminatory content.
The model also evaluates the performance of zero-sample learning for traditional NLP tasks (such as answering questions, reading comprehension and abstracts). The developers found that the performance of the model on some of these tasks is worse than that of GPT-3. This is an example of "alignment tax". The consistency program based on human feedback reinforcement learning is at the cost of reducing the performance of some tasks.
The performance regression of these data sets can be greatly reduced by a technique called pre-training blending: during the training of PPO model by gradient descent, the gradient update is calculated by blending the gradient of SFT model and PPO model.
Disadvantages of the method
A very obvious limitation of this method is that in the process of keeping the language model consistent with human intentions, the data used for fine-tuning model will be affected by various complex subjective factors, mainly including:
The preference of the manual announcer who generates the demo data;
Researchers who design research and write label instructions;
Select the prompt made by developers or provided by OpenAI customers;
Marker bias is included in both RM model training and model evaluation.
The author of ChatGPT also acknowledged the obvious fact that the announcers and researchers involved in the training process may not fully represent all potential end users of the language model.
In addition to this obvious "endogenous" limitation, this method has some other shortcomings and problems to be solved:
Lack of control research: the reported results measure the performance of the final PPO model based on the SFT model. This may be misleading: How do you know that these improvements are due to RLHF? Therefore, it is very necessary to conduct a comparative study, including investing the same amount of time as the marked man-hours used to train the RM model, in order to create a larger selected and supervised data set with high-quality data. In this way, we can objectively measure the performance improvement of RLHF method compared with the monitoring method. In short, the lack of such a comparative study has left a fundamental question completely open: Is RLHF really doing well in the consistent language model?
The comparative data lacks the basic fact: the announcer usually has different opinions on the ranking of the model output. Technically, the risk is to add a large variance to the comparative data without any basic facts.
Human preferences are not homogeneous: the RLHF method treats human preferences as homogeneous and static. It is obviously inaccurate to assume that all people have the same values. Although there are a lot of public values, there are still many different human perceptions in many matters.
RM model prompt stability test: no experiment shows that RM model is sensitive to the change of input prompt. If two prompts are syntactically different but semantically equivalent, can RM models show significant differences in the ranking of model outputs? How important is the quality of prompt to RM?
Other problems: In the RL method, the model can sometimes learn to control its own RM model to achieve the desired results, resulting in "over-optimization strategy". This may cause the model to re-create some patterns, which make the RM model score higher for some unknown reasons. ChatGPT fixes this by using the KL penalty item in the RM function.







