+86-755-28171273
Home / Knowledge / Details

Mar 30, 2022

An article to understand the causal relationship between voice interaction of service robots and three major technologies

With the development of the times and the improvement of artificial intelligence technology, intelligent service robots have been more and more widely used in various industries and fields, such as welcome robots, intelligent explanation robots, and conference robots that we can see everywhere. Labour shortages and stress on staff have played an important role.

When we have a voice conversation with a service robot, have we ever thought about what technology it relies on to receive our voice and give a timely and accurate reply? For example: "How is the weather today"? The service robot will immediately say : "The weather is sunny today, the temperature is 10℃-22℃", accompanied by southeasterly winds of magnitude 4-5...


In fact, the principle of voice interaction of service robots is similar to that of human beings. To achieve normal interaction, three conditions must be met: listening with the ears, understanding with the brain, and answering with the mouth. The "three major technologies" for service robots to realize intelligent interaction are speech recognition technology (ASR), which is equivalent to its "ear"; natural language processing technology (NLP), which is equivalent to its "brain"; speech synthesis technology (TTS) , equivalent to its "mouth".



When we ask a question, the intelligent service robot will receive the sound through the microphone, and the sound will be converted into text and letter information that the service robot can recognize through speech recognition technology (ASR). As shown in the figure above, in the noisy environment of Ningxia Museum, Xiaoben intelligent service robot can accurately "listen" to the voice of tourists through speech recognition technology (ASR), and convert it into a language that it can recognize, for the next step of semantic analysis, Understand to prepare.


Xiaoben intelligent service robot speech recognition technology (ASR) adopts international advanced algorithm, and through coding, converts speech into a style that Xiaoben intelligent service robot can recognize (that is, digital vector representation), because the sound signal is not directly recognized by the service robot. , it is necessary to cut the sound into a small segment of audio, and then each segment is represented by a certain regular digital vector.


Then there is the decoding process, that is, the process of splicing digital vectors into text and letters. Put the encoded vector in the acoustic model and the language model, you can get the corresponding words and letters of each small segment, and then pass the translated words and letters through the language model to form words that Xiaobian can recognize.


Of course, the acoustic model and the language model are also neural networks, which are trained by the Xiaoben intelligent service robot through a large amount of speech and language data, which is one of the reasons why the Xiaoben intelligent service robot can accurately recognize various complex voices...


After the decoding is completed, the recognizable word information will be accurately understood by the service robot's natural language processing technology (NLP) to accurately understand customer intentions, emotional tendencies and other information, which is one of the core of voice interaction and one of the most difficult modules.



Natural language processing technology (NLP) has the ability to measure people's opinions and tendencies through technologies such as syntax analysis, syntactic analysis, semantic understanding text similarity processing, and sentiment analysis, and can accurately distinguish which words belong to this intention and which expressions do not. A class of intent. The natural language processing technology (NLP) independently developed by Xiaoben Intelligent can analyze and understand the received information. The picture above shows Xiaoben Intelligent Service Robot in the office of Jinan Energy Group. The people who come to handle business only need to say their needs. , you can accurately understand the intentions of the masses, extract the corresponding answers from the "5G cloud brain", and issue accurate reply instructions.


The "5G cloud brain" of Xiaoben intelligent service robot stores a massive knowledge base, which can support information inquiries such as common sense, weather, air tickets, etc., and synchronize various forms of enterprise information, so that enterprise information can be displayed in the form of voice, video, and animation. Come out and meet more than 98% of visitors' daily chat or corporate business Q&A.


When the service robot sends a reply command, it needs to use its "mouth" to say it, which requires the use of speech synthesis technology (TTS), which converts the reply command into a voice that humans can understand. As shown in the figure below, after the Xiaoben intelligent service robot at Jinan Coach Terminal "understands" the tourists' questions, it extracts accurate reply instructions from the "5G cloud brain" and converts them into voice, video, and pictures that tourists can understand. , allowing tourists to easily grasp travel information.



The workflow of speech synthesis technology (TTS) can be divided into two steps. The first step is text processing. What this step does is to convert text or letter instructions into phoneme sequences, and mark the start and end time and frequency changes of each phoneme. The role of this step should not be underestimated, such as the distinction of words with the same spelling but different pronunciations, the processing of abbreviations, and the determination of pause positions.


The second step is speech synthesis. This step is to generate speech according to the start and end time and frequency changes of the already marked phonemes, and finally express it accurately through the speaker.


Xiaoben Intelligent Speech Synthesis Technology (TTS) technology can realize real-time conversion of text, and the conversion time can be calculated in seconds. The voice rhythm of text output is smooth, making the listener feel natural when listening to the information, and there is almost no indifference and jerky output of machine voice. feel.


Xiaoben Intelligent's powerful natural language processing capabilities can meet the needs of efficient and accurate services in different application scenarios, and synchronize different sample data for different application scenarios. Therefore, Xiaoben Intelligent has served 6000+ customers, covering government affairs. Halls, courts, shopping malls, airports and other industry-wide scenarios meet the differentiated needs of different enterprises.


From speech recognition to intelligent question and answer, from intent recognition to sentiment analysis, it all shows Xiaoben intelligent service robot's persistent pursuit of deep-level service in the contemporary scene. In the future, Xiaoben will continue to provide valuable insights for enterprises and society, so as to rejuvenate traditional industries and make our lives more convenient and efficient.


Send Message