ChatGPT's development history, principle, technical architecture and industry future

On December 1 this year, OpenAI launched ChatGPT, an artificial intelligence chat prototype, which once again caught the eye and triggered a big discussion for the AI industry, similar to AIGC's unemployment of artists.

It is reported that ChatGPT has attracted more than 1 million Internet registered users in just a few days after its trial. In addition, the social network flows out various interesting conversations that ask questions or flirt with ChatGPT. Some people even compared ChatGPT to a combination of "search engine+social software", which can obtain reasonable answers to questions in the process of real-time interaction.

ChatGPT is a language model focusing on dialogue generation. It can generate corresponding intelligent responses according to the user's text input. The answer can be short words or long speeches. GPT is the abbreviation of the Generation Pre-trained Transformer.

By learning a large number of ready-made texts and conversation collections (such as wikis), ChatGPT can have instant conversations like humans and answer various questions fluently. (Of course, the answer speed is slower than that of people.) Whether in English or other languages (such as Chinese, Korean, etc.), from answering historical questions to writing stories, even writing business plans and industry analysis, "almost" can do everything. Some programmers even posted ChatGPT for program modification.

ChatGPT can also be used in conjunction with other AIGC models to obtain more cool and practical functions. For example, the drawing of living room is generated through dialogue above. This has greatly strengthened the ability of AI applications to communicate with customers, and made us see the dawn of AI's large-scale implementation.

1. Inheritance and characteristics of ChatGPT

1.1 OpenAI family

Let's first understand which god OpenAI is.

OpenAI, headquartered in San Francisco, was co-founded by Tesla's Musk, Sam Altman and other investors in 2015, with the goal of developing AI technology for the benefit of all mankind. Musk left in 2018 because of the company's different development direction.

Previously, OpenAI was famous for its GPT series of natural language processing models. Since 2018, OpenAI has started to release the GPT (Generative Pre-trained Transformer) language model, which can be used to generate articles, code, machine translation, Q&A and other content.

The number of parameters of each generation of GPT model has increased explosively, which can be called "the bigger the better". The number of GPT-2 parameters released in February 2019 was 1.5 billion, while the number of GPT-3 parameters in May 2020 reached 175 billion.

Comparison of main models of GPT family

1.2 Main features of ChatGPT

ChatGPT is a dialog AI model developed based on GPT-3.5 (Generic Pre-trained Transformer 3.5) architecture, and it is the brother model of InstrumentGPT. ChatGPT is likely to be an exercise of OpenAI before the official launch of GPT-4, or to collect a large amount of conversation data.

Main features of ChatGPT

OpenAI trained ChatGPT using RLHF (Reinforcement Learning from Human Feedback) technology, and added more manual supervision for fine tuning.

In addition, ChatGPT also has the following features:

1) You can take the initiative to admit your mistakes. If users point out their mistakes, the model will listen to opinions and optimize the answers.

2) ChatGPT can question incorrect questions. For example, when asked the question "Columbus came to the United States in 2015", the robot will explain that Columbus does not belong to this era and adjust the output results.

3) ChatGPT can admit its ignorance and ignorance of professional technology.

4) Support continuous rounds of dialogue.

Unlike all kinds of smart speakers and "artificial intelligence disabilities" that we use in our lives, ChatGPT will remember the conversation information of previous users during the conversation, that is, context understanding, in order to answer some hypothetical questions. ChatGPT can achieve continuous dialogue, greatly improving the user experience under the interactive mode.

For accurate translation (especially the transliteration of Chinese and people's names), ChatGPT is still a long way from perfection, but it is similar to other online translation tools in terms of text fluency and identification of specific people's names.

As ChatGPT is a large language model, it does not have the network search function at present, so it can only answer based on the data set it has in 2021. For example, it doesn't know the situation of the 2022 World Cup, nor does it answer the weather today or help you search for information like Apple's Siri. If ChatGPT can find learning corpus and search knowledge online, it is estimated that it will make a greater breakthrough.

Even with limited knowledge, ChatGPT can still answer many wonderful questions of human beings with big brain holes. In order to avoid the bad habits of ChatGPT, ChatGPT reduces harmful and deceptive training input through algorithm shielding., The query is filtered through the appropriate API, and the suggestion of potential racism or gender discrimination is rejected.

2. Principle of ChatGPT/GPT

2.1 NLP

Known limitations in the NLP/NLU field include misunderstanding of repetitive texts, highly specialized topics, and contextual phrases.

For humans or AI, it usually takes years of training to talk normally. NLP models should not only understand the meaning of words, but also understand how to make sentences and give meaningful answers in context, and even use appropriate slang and professional words.

Application field of NLP technology

In essence, GPT-3 or GPT-3.5, which is the basis of ChatGPT, is a very large statistical language model or sequential text prediction model.

2.2 GPT v.s.BERT

Similar to the BERT model, ChatGPT or GPT-3.5 automatically generates each word (word) of the answer according to the input statement and the language/corpus probability. From the point of view of mathematics or machine learning, language model is to model the probability correlation distribution of word sequence, that is, to predict the probability distribution of different statements or even language sets at the next moment by using the said statements (statements can be regarded as vectors in mathematics) as input conditions.

ChatGPT uses reinforcement learning from human feedback for training. This method enhances machine learning through human intervention to achieve better results. In the training process, human trainers play the role of users and artificial intelligence assistants, and fine-tune through the near-end strategy optimization algorithm.

Because of ChatGPT's stronger performance and massive parameters, it contains more subject data and can handle more niche topics. ChatGPT can now further handle the tasks of answering questions, writing articles, text summaries, language translation and generating computer code.

Technical architecture of BERT and GPT (En in the figure is each word of input, Tn is each word of output response)

3. ChatGPT's technical architecture

3.1 Evolution of GPT family

Speaking of ChatGPT, we have to mention the GPT family.

ChatGPT had several well-known brothers before it, including GPT-1, GPT-2 and GPT-3. These brothers are bigger than each other, and ChatGPT is more similar to GPT-3.

Technical comparison between ChatGPT and GPT 1-3

Both GPT family and BERT model are well-known NLP models based on Transformer technology. GPT-1 has only 12 Transformer layers, while GPT-3 has increased to 96 layers.

3.2 Human feedback reinforcement learning

The main difference between InstructGPT/GPT3.5 (the predecessor of ChatGPT) and GPT-3 is that it has added RLHF (Reinforcement Learning from Human Feedback). This training paradigm enhances the human's regulation of the output results of the model and makes the results more understandable.

In InstrumentGPT, the following are the evaluation criteria of "goodness of intentions".

Authenticity: Is it false information or misleading information?

Harmlessness: Does it cause physical or mental harm to people or the environment?

Usefulness: Does it solve the user's task?

3.3 TAMER framework

The framework of TAMER (Training an Agent Manually via Evaluative Reinforcement) has to be mentioned here. This framework introduces human markers into the learning cycle of agents, which can provide reward feedback to agents (i.e. guide agents to train) through human beings, so as to quickly achieve the training task objectives.

TAMER Framework Paper

The main purpose of introducing human markers is to speed up training. Although reinforcement learning technology has outstanding performance in many fields, there are still many shortcomings, such as slow training convergence speed and high training cost. Especially in the real world, the exploration cost or data acquisition cost of many tasks is very high. How to speed up the training efficiency is one of the important problems to be solved in the reinforcement learning task.

TAMER can train the agent in the form of reward letter feedback based on the knowledge of human journalists to speed up its rapid convergence. TAMER does not require the marker to have professional knowledge or programming technology, and the cost of corpus is lower. Through TAMER+RL (reinforcement learning) and with the help of the feedback of human marker, the process of reinforcement learning (RL) from the Markov decision process (MDP) reward can be enhanced.

The application of TAMER architecture in reinforcement learning

In terms of specific implementation, the human marker acts as the user and AI assistant of the dialogue, provides the dialogue sample, and allows the model to generate some responses. Then the marker will score and rank the response options, and feed back the better results to the model. Agents learn from two feedback modes at the same time - human reinforcement and Markov decision process reward as an integrated system, fine-tune the model through the reward strategy and continue to iterate.

On this basis, ChatGPT can better understand and complete human language or instructions, imitate human beings and provide coherent and logical text information than GPT-3.

3.4 ChatGPT training

The training process of ChatGPT is divided into the following three stages:

The first stage: training supervision strategy model

GPT 3.5 itself is difficult to understand the disagreement graph contained in different types of human instructions, and it is also difficult to judge whether the generated content is a high-quality result. In order to make GPT 3.5 initially have the intention to understand the instructions, firstly, we will randomly select questions from the data set, and human tagging personnel will give high-quality answers, and then use these manually labeled data to fine-tune the GPT-3.5 model (obtain SFT model, Supervised Fine-Tuning).

At this time, the SFT model has been better than GPT-3 in following instructions/conversations, but it does not necessarily conform to human preferences.

Training process of ChatGPT model

Stage 2: Reward Mode (RM)

At this stage, the return model is trained by manually marking the training data (about 33K data). Randomly select questions from the data set, and use the model generated in the first stage to generate multiple different answers for each question. The human announcer gives the ranking order based on the comprehensive consideration of these results. This process is similar to that of a coach or teacher.

Next, use the sorting result data to train the reward model. Combine multiple sorting results in pairs to form multiple training data pairs. The RM model accepts an input and gives a score to evaluate the quality of the response. In this way, for a pair of training data, adjust the parameters so that the score of high-quality answers is higher than that of low-quality answers.

The third stage: use PPO (Proximal Policy Optimization) reinforcement learning to optimize the strategy.

The core idea of PPO is to transform the training process of On-policy in the Policy Gradient into Off-policy, that is, to transform online learning into offline learning. This transformation process is called Importance Sampling. In this stage, the reward model trained in the second stage is used to update the pre-training model parameters by reward scoring. Randomly select questions from the data set, use PPO model to generate answers, and use the RM model trained in the previous stage to give quality scores. Transfer the return score in turn to generate a strategy gradient, and update the PPO model parameters through reinforcement learning.

If we continue to repeat the second and third stages, we will train a higher quality ChatGPT model through iteration.

4. Limitations of ChatGPT

As long as the user enters a question, ChatGPT can give an answer. Does that mean that we can get the desired answer immediately without feeding Google or Baidu with keywords?

Although ChatGPT has demonstrated excellent contextual dialogue ability and even programming ability, and has improved the public's impression of the human-machine conversation robot (ChatBot) from "artificial intelligence" to "interesting", we should also see that ChatGPT technology still has some limitations and is still making progress.

1) ChatGPT lacks "human common sense" and the ability to extend in its field without a large amount of corpus training, and may even be serious "nonsense". ChatGPT can "create answers" in many fields, but when users seek the correct answers, ChatGPT may also give misleading answers. For example, let ChatGPT do a primary school application problem. Although it can write a long series of calculation processes, the final answer is wrong.

2) ChatGPT cannot deal with complicated and lengthy or specialized language structures. For questions from very professional fields such as finance, natural science or medicine, ChatGPT may not be able to generate appropriate answers without sufficient corpus "feeding".

3) ChatGPT needs a lot of computing power (chips) to support its training and deployment. Despite the need for a large number of corpus data training models, at present, ChatGPT still needs the support of powerful servers in its application, and the cost of these servers is unbearable for ordinary users. Even the model with billions of parameters also needs an amazing amount of computing resources to run and train., If there are hundreds of millions of user requests for real search engines, such as the current free strategy, no enterprise can afford this cost. Therefore, for the general public, we still need to wait for a lighter model or a more cost-effective computing platform.

4) ChatGPT has not been able to incorporate new knowledge online, and it is also unrealistic to retrain the GPT model when some new knowledge appears. It is difficult for ordinary trainers to accept either the training time or the training cost. If online training mode is adopted for new knowledge, it seems feasible and the cost of corpus is relatively low, but it is easy to cause the catastrophic forgetting of the original knowledge due to the introduction of new data.

5) ChatGPT is still a black box model. At present, the inherent algorithm logic of ChatGPT has not been decomposed, so there is no guarantee that ChatGPT will not produce attacks or even hurt users.

Of course, the flaw is not hidden. An engineer posted a dialog asking ChatGPT to write verilog code (chip design code). It can be seen that ChatGPT level has exceeded some verilog beginners.

5. Future improvement direction of ChatGPT

5.1 RLAIF to reduce human feedback

Most of the founding team members of Anthropic are early and core employees of OpenAI, and have participated in OpenAI's GPT-3, multimodal neurons, and reinforcement learning of human preferences.

In December 2022, Anthropic published a paper again titled "Constructional AI: Harmlessness from AI Feedback" to introduce the artificial intelligence model Claude.

CAI model training process

Claude and ChatGPT both rely on reinforcement learning (RL) to train the preference model. CAI (Constructional AI) is also based on RLHF. The difference is that the sorting process of CAI uses models (not humans) to provide an initial sorting result for all generated output results.

CAI uses AI feedback to replace human preference for expressing harmlessness, that is, RLAIF. AI evaluates the response content according to a set of principles.

5.2 Complement the mathematical and physical weaknesses

Although ChatGPT has strong dialogue ability, it is easy to get serious nonsense in the dialogue of mathematical calculation.

Computer scientist Stephen Wolfram proposed a solution to this problem. Wolfram language and computational knowledge search engine Wolfram | Alpha created by Stephen Wolfram, whose background is realized through Mathematica.

ChatGPT and Wolfram | Alpha combine to deal with sorting problems

In this combination system, ChatGPT can "talk" with Wolfram | Alpha just like humans use Wolfram | Alpha, and Wolfram | Alpha will use its symbolic translation ability to "translate" the natural language expression obtained from ChatGPT into the corresponding symbolic computing language. In the past, the academic community has always been divided on the "statistical method" used by ChatGPT and the "symbolic method" of Wolfram | Alpha. But now the complementarity between ChatGPT and Wolfram | Alpha provides the NLP field with the possibility of a higher level.

ChatGPT does not need to generate such code. It only needs to generate regular natural language, and then use Wolfram | Alpha to translate it into an accurate Wolfram Language, which is then calculated by the underlying Mathematica.

5.3 Miniaturization of ChatGPT

Although ChatGPT is very powerful, its model size and use cost also make many people flinch.

There are three types of model compression that can reduce the size and cost of the model.

The first method is quantization, which reduces the accuracy of the numerical representation of a single weight. For example, the decline of Tansform from FP32 to INT8 has little impact on its accuracy.

The second model compression method is pruning, that is, deleting network elements, including channels from a single weight (unstructured pruning) to higher-grained components such as weight matrix. This method is effective in visual and small-scale language models.

The third model compression method is sparse. For example, the SparseGPT (arxiv. org/pdf/2301.0077) proposed by the Austrian Institute of Science and Technology (ISTA) can prune the GPT series model to 50% sparsity at one time without any retraining. For the GPT-175B model, only a single GPU is needed to achieve this pruning in a few hours.

SparseGPT compression process

6 ChatGPT's industrial future and investment opportunities

6.1 AIGC

Speaking of ChaGPT, we have to mention AIGC.

AIGC uses AI technology to generate content. Compared with UGC (user produced content) and PGC (professional produced content) in the previous era of Web 1.0 and Web 2.0, AIGC, which represents AI conceived content, is a new round of content production mode change, and AIGC content will also grow exponentially in the era of Web 3.0.

The emergence of ChatGPT model is of great significance to the application of AIGC in text/voice mode, and will have a significant impact on the upstream and downstream of AI industry.

6.2 Benefit scenario

From the perspective of downstream related beneficial applications, including but not limited to codeless programming, novel generation, dialogue search engine, voice companion, voice work assistant, dialogue virtual human, artificial intelligence customer service, machine translation, chip design, etc. From the perspective of upstream increased demand, including computing power chips, data annotation, natural language processing (NLP), etc.

Large models are exploding (more parameters/greater demand for computing power chips)

With the continuous progress of algorithm technology and computational power technology, ChatGPT will also move towards a more advanced and functional version, which will be applied in more and more fields and generate more and better dialogues and content for human beings.