CHAPTER 24 Dialog Systems and Chatbots
Speech and Language Processing ed3 读书笔记
Language is the mark conversation of humanity and sentience, and conversation or dialog is the most fundamental and specially privileged arena of language.
This chapter introduces the fundamental algorithms of conversational agents, or dialog systems.
Task-oriented dialog agents are designed for a particular task and set up to have short conversations (from as little as a single interaction to perhaps half-a-dozen interactions) to get information from the user to help complete the task.
Chatbots are systems designed for extended conversations, set up to mimic the unstructured conversational or ‘chats’ characteristic of human-human interaction, rather than focused on a particular task like booking plane flights.
24.1 Chatbots
Like practically everything else in language processing, chatbot architectures fall into two classes: rule-based systems and corpus-based systems. Rule-based systems include the early influential ELIZA and PARRY systems. Corpus-based systems mine large datasets of human-human conversations, which can be done by using information retrieval (IR-based systems simply copy a human’s response from a previous conversation) or by using a machine translation paradigm such as neural network sequence-to-sequence systems, to learn to map from a user utterance to a system response.
24.1.1 Rule-based chatbots: ELIZA and PARRY
[外链图片转存失败(img-a84ni2wx-1563877166603)(24.5.png)]
24.1.2 Corpus-based chatbots
Corpus-based chatbots, instead of using hand-built rules, mine conversations of human-human conversations, or sometimes mine the human responses from human-machine conversations. Chatbot responses can even be extracted from sentences in corpora of non-dialog text.
There are two common architectures for corpus-based chatbots: information retrieval, and machine learned sequence transduction. Like rule-based chatbots (but unlike frame-based dialog systems), most corpus-based chatbots do very little modeling of the conversational context. Instead they focus on generating a single response turn that is appropriate given the user’s immediately previous utterance. For this reason they are often called response generation systems. Corpus-based chatbots thus have some similarity to question answering systems, which focus on single responses while ignoring context or larger conversational goals.
IR-based chatbots
The principle behind information retrieval based chatbots is to respond to a user’s turn X by repeating some appropriate turn Y from a corpus of natural (human) text. The differences across such systems lie in how they choose the corpus, and how they decide what counts as an appropriate human turn to copy.
A common choice of corpus is to collect databases of human conversations. These can come from microblogging platforms like Twitter or any Weibo (微博). Another approach is to use corpora of movie dialog. Once a chatbot has been put into practice, the turns that humans use to respond to the chatbot can be used as additional conversational data for training.
Given the corpus and the user’s sentence, IR-based systems can use any retrieval algorithm to choose an appropriate response from the corpus. The two simplest methods are the following:
-
Return the response to the most similar turn: Given user query q q q and a conversational corpus C C C, find the turn t t t in C C C that is most similar to q q q (for example has the highest cosine with q q q) and return the following turn, i.e. the human response to t t t in C C C:
r = r e s p o n s e ( argmax t ∈ C q T t ∥ q ∥ ∥ t ∥ ) r=response\left(\mathop{\operatorname{argmax}}\limits_{t \in C} \frac{q^{T} t}{\|q\|\| t \|} \right) r=response(t∈Cargmax∥q∥∥t∥qTt)
The idea is that we should look for a turn that most resembles the user’s turn, and return the human response to that turn (Jafarpour et al. 2009, Leuski and Traum 2011). -
Return the most similar turn: Given user query q q q and a conversational corpus C C C, return the turn t t t in C C C that is most similar to q q q (for example has the highest cosine with q q q):
r = argmax t ∈ C q T t ∥ q ∥ ∥ t ∥ r=\mathop{\operatorname{argmax}}\limits_{t \in C} \frac{q^{T} t}{\|q\|\| t \|} r=t∈Cargmax∥q∥∥t∥qTt
The idea here is to directly match the users query q q q with turns from C C C, since a good response will often share words or semantics with the prior turn.
In each case, any similarity function can be used, most commonly cosines computed either over words (using tf-idf) or over embeddings.
Although returning the response to the most similar turn seems like a more intuitive algorithm, returning the most similar turn seems to work better in practice, perhaps because selecting the response adds another layer of indirection that can allow for more noise (Ritter et al. 2011, Wang et al. 2013).
The IR-based approach can be extended by using more features than just the words in the q (such as words in prior turns, or information about the user), and using any full IR ranking approach. Commercial implementations of the IR-based approach include Cleverbot (Carpenter, 2017) and Microsoft’s XiaoIce (Little Bing小冰) system (Microsoft, 2014).
Instead of just using corpora of conversation, the IR-based approach can be used to draw responses from narrative (non-dialog) text. For example, the pioneering COBOT chatbot (Isbell et al., 2000) generated responses by selecting sentences from a corpus that combined the Unabomber Manifesto by Theodore Kaczynski, articles on alien abduction, the scripts of “The Big Lebowski” and “Planet of the Apes”. Chatbots that want to generate informative turns such as answers to user questions can use texts like Wikipedia to draw on sentences that might contain those answers (Yan et al., 2016).
Sequence to sequence chatbots
An alternate way to use a corpus to generate dialog is to think of response generation as a task of transducing from the user’s prior turn to the system’s turn. This is basically the machine learning version of Eliza; the system learns from a corpus to transduce a question to an answer.
This idea was first developed by using phrase-based machine translation (Ritter et al., 2011) to translate a user turn to a system response.
Instead, (roughly contemporaneously by Shang et al. 2015, Vinyals and Le 2015, and Sordoni et al. 2015) transduction models for response generation were modeled using encoder-decoder (seq2seq) models (Chapter 22), as shown in Fig. 24.6.
[外链图片转存失败(img-jRmppFOr-1563877166605)(24.6.png)]
A number of modifications are required to the basic seq2seq model to adapt it for the task of response generation. For example basic seq2seq models have a tendency to produce predictable but repetitive and therefore dull responses like “I’m OK” or “I don’t know” that shut down the conversation. This can be addressed by changing the objective function for seq2seq model training to a mutual information objective, or by modifying a beam decoder to keep more diverse responses in the beam (Li et al., 2016a).
Another problem with the simple SEQ2SEQresponse generation architecture is its inability to model the longer prior context of the conversation. This can be done by allowing the model to see prior turns, such as by using a hierarchical model that summarizes information over multiple prior turns (Lowe et al., 2017b).
Finally, SEQ2SEQresponse generators focus on generating single responses, and so don’t tend to do a good job of continuously generating responses that cohere across multiple turns. This can be addressed by using reinforcement learning, as well as techniques like adversarial networks, to learn to choose responses that make the overall conversation more natural (Li et al. 2016b, Li et al. 2017).
Fig. 24.7 shows some sample responses generated by a vanilla SEQ2SEQmodel, and from a model trained by an adversarial algorithm to produce responses that are harder to distinguish from human responses (Li et al., 2017).
[外链图片转存失败(img-3naAmp7I-1563877166605)(24.7.png)]
Evaluating Chatbots
Chatbots are generally evaluated by humans. The slot-filling evaluations used for task-based dialog (Section 24.2.3) aren’t appropriate for this task (Artstein et al., 2009), and word-overlap metrics like BLEU for comparing a chatbot’s response to a human response turn out to correlate very poorly with human judgments (Liu et al., 2016). BLEU performs poorly because there are so many possible responses to any given turn; word-overlap metrics work best when the space of responses is small and lexically overlapping, as is the case in machine translation.
While human evaluation is therefore required for evaluating chatbots, there are beginning to be models for automatic evaluation. The ADEM (Lowe et al., 2017a) classifier is trained on a set of responses labeled by humans with how appropriate they are, and learns to predict this label from the dialog context and the words in the system response.
Another paradigm is adversarial evaluation (Bowman et al. 2016, Kannan and Vinyals 2016, Li et al. 2017), inspired by the Turing test. The idea is to train a “Turing-like” evaluator classifier to distinguish between human-generated responses and machine-generated responses. The more successful a response generation system is at fooling this evaluator, the better the system.
24.2 Frame Based Dialog Agents
Modern task-based dialog systems are based on a domain ontology, a knowledge structure representing the kinds of intentions the system can extract from user sentences. The ontology defines one or more frames, each a collection of slots, and defines the values that each slot can take. This frame-based architecture was first introduced in 1977 in the influential GUS system for travel planning (Bobrow et al., 1977), and has been astonishingly long-lived, underlying most modern commercial digital assistants.
Types in GUS, as in modern frame-based dialog agents, may have hierarchical structure; for example the date type in GUS is itself a frame with slots with types like integer or members of sets of weekday names:
DATE
\verb|DATE|
DATE
MONTH NAME
\verb|MONTH NAME|
MONTH NAME
DAY (BOUNDED-INTEGER 1 31)
\verb|DAY (BOUNDED-INTEGER 1 31)|
DAY (BOUNDED-INTEGER 1 31)
YEAR INTEGER
\verb|YEAR INTEGER|
YEAR INTEGER
WEEKDAY (MEMBER (SUNDAY MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY))
\verb|WEEKDAY (MEMBER (SUNDAY MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY)) |
WEEKDAY (MEMBER (SUNDAY MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY))
24.2.1 Control structure for frame-based dialog
The control architecture of frame-based dialog systems is designed around the frame. The goal is to fill the slots in the frame with the fillers the user intends, and then perform the relevant action for the user (answering a question, or booking a flight). Most frame-based dialog systems are based on finite-state automata that are hand-designed for the task by a dialog designer.
24.2.2 Natural language understanding for filling slots
The goal of the natural language understanding component is to extract three things from the user’s utterance. The first task is domain classification. The second is user intent determination. Finally, we need to do slot filling: extract the particular slots and fillers that the user intends the system to understand from their utterance with respect to their intent.
24.2.3 Evaluating Slot Filling
An intrinsic error metric for natural language understanding systems for slot filling is the Slot Error Rate for each sentence:
Slot Error Rate for a Sentence
=
#
of inserted/deleted/subsituted slots
#
of total reference slots for sentence
\text{Slot Error Rate for a Sentence}=\frac{\# \text { of inserted/deleted/subsituted slots }}{\# \text { of total reference slots for sentence }}
Slot Error Rate for a Sentence=# of total reference slots for sentence # of inserted/deleted/subsituted slots
A perhaps more important, although less fine-grained, measure of success is an extrinsic metric like task error rate. In this case, the task error rate would quantify how often the correct meeting was added to the calendar at the end of the interaction.
24.2.4 Other components of frame-based dialog
24.3 VoiceXML
VoiceXML, the Voice Extensible Markup Language (http://www.voicexml.org/), an XML-based dialog design language for creating simple frame-based dialogs.
24.4 Evaluating Dialog Systems
User satisfaction rating
Task completion success: measured by evaluating the correctness of the total solution or by the user’s perception of whether they completed the task.
Efficiency cost: measures of the system’s efficiency at helping users.
Quality cost: Quality cost measures other aspects of the interactions that affect users’ perception of the system.
24.5 Dialog System Design
This process is often called voice user interface design, and generally follows the user-centered design principles of Gould and Lewis (1985):
-
Study the user and task: Understand the potential users and the nature of the task by interviews with users, investigation of similar systems, and study of related human-human dialogs.
-
Build simulations and prototypes: A crucial tool in building dialog systems is the Wizard-of-Oz system.
-
Iteratively test the design on users: An iterative design cycle with embedded user testing is essential in system design (Nielsen 1992, Cole et al. 1997, Yankelovich et al. 1995, Landauer 1995).
24.5.1 Ethical Issues in Dialog System Design
24.6 Summary
Conversational agents are a crucial speech and language processing application that are already widely used commercially.
- Chatbots are conversational agents designed to mimic the appearance of informal human conversation. Rule-based chatbots like ELIZA and its modern descendants use rules to map user sentences into system responses. Corpus-based chatbots mine logs of human conversation to learn to automatically map user sentences into system responses.
- For task-based dialog, most commercial dialog systems use the GUS or frame-based architecture, in which the designer specifies a domain ontology, a set of frames of information that the system is designed to acquire from the user, each consisting of slots with typed fillers.
- A number of commercial systems allow developers to implement simple frame-based dialog systems, such as the user-definable skills in Amazon Alexa or the actions in Google Assistant. VoiceXML is a simple declarative language that has similar capabilities to each of them for specifying deterministic framebased dialog systems.
- Dialog systems are a kind of human-computer interaction, and general HCI principles apply in their design, including the role of the user, simulations such as Wizard-of-Oz systems, and the importance of iterative design and testing on real users.