CHAPTER 25 Advanced Dialog Systems-优快云博客

CHAPTER 25 Advanced Dialog Systems

Speech and Language Processing ed3 读书笔记

In this chapter we describe the dialog-state architecture, also called the belief-state or information-state architecture. Like GUS systems, these agents fill slots, but they are also capable of understanding and generating such dialog acts, actions like asking a question, making a proposal, rejecting a suggestion, or acknowledging an utterance and they can incorporate this knowledge into a richer model of the state of the dialog at any point.

Figure 25.1 shows a typical architecture for a dialog-state system. It has six components. As with the GUS-style frame-based systems, the speech recognition and understanding components extract meaning from the input, and the generation and TTS components map from meaning to speech. The parts that are different than the simple GUS system are the dialog state tracker which maintains the current state of the dialog (which include the user’s most recent dialog act, plus the entire set of slot-filler constraints the user has expressed so far) and the dialog policy, which decides what the system should do or say next.

[外链图片转存失败(img-89z8Xspa-1566186216498)(25.1.png)]

25.1 Dialog Acts

Each utterance in a dialog is a kind of action being performed by the speaker. These actions are commonly called speech acts; here’s one taxonomy consisting of 4 major classes (Bach and Harnish, 1979):

Constatives: committing the speaker to something’s being the case (answering, claiming, confirming, denying, disagreeing, stating)

Directives: attempts by the speaker to get the addressee to do something (advising, asking, forbidding, inviting, ordering, requesting)

Commissives: committing the speaker to some future course of action (promising, planning, vowing, betting, opposing)

Acknowledgments: express the speaker’s attitude regarding the hearer with respect to some social action (apologizing, greeting, thanking, accepting an acknowledgment)

While this idea of speech acts is powerful, modern systems expand these early taxonomies of speech acts to better describe actual conversations. This is because a dialog is not a series of unrelated independent speech acts, but rather a collective act performed by the speaker and the hearer. In performing this joint action the speaker and hearer must constantly establish common ground (Stalnaker, 1978), the set of things that are mutually believed by both speakers.

The need to achieve common ground means that the hearer must ground the speaker’s utterances. To ground means to acknowledge, to make it clear that the hearer has understood the speaker’s meaning and intention. People need closure or grounding for non-linguistic actions as well. For example, why does a well-designed elevator button light up when it’s pressed? Because this indicates to the elevator traveler that she has successfully called the elevator. Clark (1996) phrases this need for closure as follows, after Norman (1988):
Principle of closure. Agents performing an action require evidence, sufficient for current purposes, that they have succeeded in performing it.

Grounding is also important when the hearer needs to indicate that the speaker has not succeeded. If the hearer has problems in understanding, she must indicate these problems to the speaker, again so that mutual understanding can eventually be achieved.

Clark and Schaefer (1989) point out a continuum of methods the hearer B can use to ground the speaker A’s utterance, ordered from weakest to strongest:

Continued attention: B shows she is continuing to attend and therefore remains satisfied with A’s presentation.

Next contribution: B starts in on the next relevant contribution.

Acknowledgment: B nods or says a continuer like uh-huh, yeah, or the like, or an assessment like that’s great.

Demonstration: B demonstrates all or part of what she has understood A to mean, for example, by reformulating (paraphrasing) A’s utterance or by collaborative completion of A’s utterance.

Display: B displays verbatim all or part of A’s presentation.

The word uh-huh is a backchannel, also called a continuer or an acknowledgment token. A backchannel is a (short) optional utterance that acknowledges the content of the utterance of the other and that doesn’t require an acknowledgment by the other (Yngve 1970, Jefferson 1984, Schegloff 1982, Ward and Tsukahara 2000).

The third grounding method is to start in on the relevant next contribution. In a more subtle act of grounding, the speaker can combine this method with the previous one. For example, notice that whenever the client answers a question, the agent begins the next question with And. The And indicates to the client that the agent has successfully understood the answer to the last question.

Speech acts are important for practical dialog systems, which need to distinguish a statement from a directive, and which must distinguish (among the many kinds of directives) an order to do something from a question asking for information. Grounding is also crucial in dialog systems. Consider the unnaturalness of this example from Cohen et al. (2004):

(25.1) System: Did you want to review some more of your personal profile?
Caller: No.
System: What’s next?

Without an acknowledgment, the caller doesn’t know that the system has understood her ‘No’. The use of Okay below adds grounding, making (25.2) a much more natural response than (25.1):

(25.2) System: Did you want to review some more of your personal profile?
Caller: No.
System: Okay, what’s next?

The ideas of speech acts and grounding are combined in a single kind of action called a dialog act, a tag which represents the interactive function of the sentence being tagged. Different types of dialog systems require labeling different kinds of acts, and so the tagset—defining what a dialog act is exactly— tends to be designed for particular tasks.

25.2 Dialog State: Interpreting Dialog Acts

The job of the dialog-state tracker is to determine both the current state of the frame (the fillers of each slot), as well as the user’s most recent dialog act. Note that the dialog-state includes more than just the slot-fillers expressed in the current sentence; it includes the entire state of the frame at this point, summarizing all of the user’s constraints.

25.2.1 Sketching an algorithm for dialog act interpretation

Since dialog acts places some constraints on the slots and values, the tasks of dialog-act detection and slot-filling are often performed jointly.

The joint dialog act interpretation/slot filling algorithm generally begins with a first pass classifier to decide on the dialog act for the sentence.

A second pass classifier might use the sequence-model algorithms for slot-filler extraction from Section 24.2.2 of Chapter 24, such as LSTM-based IOB tagging or CRFs or a joint LSTM-CRF. Alternatively, a multinominal classifier can be used to choose between all possible slot-value pairs, again either neural such as a bi-LSTM or convolutional net, or feature-based using any of the feature functions defined in Chapter 24. This is possible since the domain ontology for the system is fixed, so there is a finite number of slot-value pairs.

25.2.2 A special case: detecting correction acts

Some standard features used for detecting correction acts are shown below:

[外链图片转存失败(img-0Y6lyf97-1566186216499)(25.2 example.png)]

25.3 Dialog Policy

The goal of the dialog policy is to decide what action the system should take next, that is, what dialog act to generate.

25.3.1 Generating Dialog Acts: Confirmation and Rejection

confirmation: explicit confirmation, implicit confirmation

rejection: progressive prompting or escalating detail, rapid reprompting

25.4 A simple policy based on local context

The goal of the dialog policy at turn $i$ in the conversation is to predict which action $A_i$ to take, based on the entire dialog state. The state could mean the entire sequence of dialog acts from the system (A) and from the user (U), in which case the task would be to compute:
$\hat{A}_{i}=\underset{A_{i} \in A}{\operatorname{argmax}} P\left(A_{i} |\left(A_{1}, U_{1}, \ldots, A_{i-1}, U_{i-1}\right)\right.$
Such a policy might then just condition on the current state of the frame Frame $_i$ (which slots are filled and with what) and the last turn by the system and user:
$\hat{A}_{i}=\underset{A_{i} \in A}{\operatorname{argmax}} P\left(A_{i} | \text { Frame}_{i-1}, A_{i-1}, U_{i-1}\right)$

25.5 Natural language generation in the dialog-state model

The task of natural language generation (NLG) in the information-state architecture is often modeled in two stages, content planning (what to say), and sentence realization (how to say it).

25.6 Deep Reinforcement Learning for Dialog

TBD

25.7 Summary

• In dialog, speaking is a kind of action; these acts are referred to as speech acts. Speakers also attempt to achieve common ground by acknowledging that they have understand each other. The dialog act combines the intuition
of speech acts and grounding acts.

• The dialog-state or information-state architecture augments the frame-and-slot state architecture by keeping track of user’s dialog acts and includes a policy for generating its own dialog acts in return.

• Policies based on reinforcement learning architecture like the MDP and POMDP offer ways for future dialog reward to be propagated back to influence policy earlier in the dialog manager.