Chapter 3
CoffeeS1
Introduction
With the basic structure for speech recognition firmly in place, CoffeeS1 expands word recognition capabilities. You navigate around the coffee shop, and you can place orders for different kinds of coffee drinks.你能为各种咖啡饮料下订单。
The example will focus on grammars generally and, on command and control grammar specifically. The following topics will be discussed
� Grammars: Command and Control, dictation.命令和控制、口述。
� Phrase: Phrase structure, word recognition. 短语结构,单词识别。
� Grammar Files: XML tagging.XML标签。
Grammar Types 语法类型
Command and Control
The last example (CoffeeS0) was not robust. You were limited to about five words. However, they are five words of special interest and using them, you could move around the application. Using grammar in the command and control function limits the use of words and bestows upon them specialized meanings. This is convenient for some uses in applications. In the last chapter, you learned that grammars could have limited recognition contexts such as for menus. As an example, you would want the application to respond, or even to attempt to respond, to certain words relating to menu items or to the menu bar such as 揻ile,� 搊pen,� or 損rint.�
CoffeeS0不健壮。仅限5个单词。
These words are provided in an exclusive list. If the word is not found, it is not recognized. Also, word order matters in some cases. In CoffeeS0, 揋o to counter� was understood and 揅ounter go to� was not. Using a list approach is also called rule-based or context-free grammar. Words are evaluated according to a fixed set of rules. In short, the word is either in the list or it is not. There is no attempt to figure out the intent of the word based on the words that came before or after it. That is, there is no context for the words.
单词在一个独有的列表中。如果没有该单词,则它不能被识别。而且,在某些情况下,单词顺序也影响识别。rule-based或context-free语法。通过规则集单词被评估。不能根据该单词的前或后一个单词,来指出该单词的意图。也就是说,该单词没有上下文。
SAPI 5 uses extensible markup language (XML) to create this list. The file may be generated ahead of time or compiled during program execution. Because command and control deals mostly with lists, words can be added dynamically and can accommodate new situations easily.
CoffeeS1 addresses word order and the ability to sequence words.
SAPI 5使用XML创建这个列表。文件可以提前或在程序编译时产生。因为命令和控制主要处理列表,单词能被动态添加,容易地适应新的情况。
Dictation
Command and control has obvious shortcomings. As mentioned, it is limited in the words used. Someone has to spend the time to manually define the command set. Often, you will want to speak any word and have it recognized. This is what a traditional speech recognition (SR) program does. That is, you can dictate any word, no matter how esoteric, into a word processor and have that word translated into text. For this use of a speech recognition engine, you must move from command and control to a dictation grammar. Instead of an XML-based vocabulary, dictation grammar uses a much more extensive range of words and determines each word based on context. The words immediately before and after it are studied and dictation grammar chooses the most likely outcome. For this reason, this is also called a statistical language model (SLM).
命令和控制有明显的缺点。受限于被使用的单词。手动地定义命令集。dictation语法根据上下文识别单词,在被识别单词前或后的单词被学习,dictation语法选择最接近的输出。正因为这样,dictation语法也被称为SLM(statistical language model统计语言模型)。
SR engines have wide latitude of vocabularies. The Microsoft SR engine that SAPI 5 includes 60,000 English words and provides an adequate engine for most people. Other engines are specialized for the legal and medical professions, for example. These can be massive databases generated by commercial firms. In addition, different languages including Japanese, Chinese, German, and Russian are also available.
SR引擎有充足的单词库。日文、中文、德语、俄语也是。
For as widely disparate as these languages and usages seem, SAPI 5 handles them in the same way. The programming approach is very similar. Two other samples provide a dictation approach to speech recognition: Simple Dictation and Dictation Pad. These may be found on the SAPI 5.1 SDK and are documented separately. Coffee on the other hand, limits itself solely to command and control usage.
Simple Dictation and Dictation Pad语音识别。
Phrases
SAPI returns the actual recognized words through a series of structures collectively called phrases. You have seen evidence of this with SPEI_PHRASE_START event indicating the start of the recognition process. For command and control uses, it is a two-step process: Determine the activated rule, and then inspect the elements (or words) within that phrase.
SAPI通过一系列短语结构识别单词。SPEI_PHRASE_START 事件表明识别开始。对于命令和控制,分两步:一决定活动的规则,二检查短语中的单词。
CoffeeS0 briefly introduced the first step. While processing a recognition event, you discovered which rule was activated but stopped there. CoffeeS1 takes the next logical step to recognize the exact words used so patrons can get their drinks.
This examination takes place in the ExecuteCommand() routine. As in CoffeeS0, one of the parameters is the phrase. Remember, this phrase is the final result, rather than a hypothesis, so assume SAPI is savvy enough to translate exactly what you said. You will be depending on this phrase for navigation. At this point, you are only interested in which grammar rule was activated. That means the patrons still cannot go to different places in the shop even though they might request to do so. All navigation statements always lead to the counter.
ExecuteCommand函数的一个参数是作为最终结果的短语。该短语作为导航。只关心哪个语法规则是活动的。
However, CoffeeS1 introduces a new grammar rule: VID_EspressoDrinks. Defined in coffee.xml, this rule lists all drinks available to the customers. Actually, it is several rules bound together that will be discussed later. Again, you are only concerned in the top-level rule of VID_EspressoDrinks. If you place an order that matches this rule, the rule activates and the result is passed back from SAPI. In typical demanding coffee shop fashion, this could be 揋et me an iced decaf single tall peppermint whole espresso.� Orders could even include 搒ingle triple short tall grande,� and still be valid SAPI grammar although it might raise an eyebrow (if that were possible).
CoffeeS1介绍了新的语法规则VID_EspressoDrinks。定义在coffee.xml。列出了所有饮料。
With the order placed and recognition successful, CoffeeS1 now gets to the task of dissecting the phrase.
解剖短语。
From the original phrase, an IspPhrase interface contains the method GetPhrase()to construct the elements (or word) list.
SPPHRASE *pElements;
if (SUCCEEDED(pPhrase->GetPhrase(&pElements)))
If successful, pElements contains all the information required to construct the sentence. To determine which rule activated and then to learn more about it, look at the Member Rule. This is a structure (SPPHRASERULE) but one that fully describes the rule. The rule ID is found in its member ulId. CoffeeS1 numerically defines the rule VID_EspressoDrinks in the XML file, so that matching becomes easy. Use a simple switch statement in the code to determine the more specific handling routines.
Two things need to be pointed out about the upcoming word list. First, the words are represented numerically rather than by a string. Associating the value of the word to the string itself uses a look-up table. In this case, CoffeeS1 stores the words as a resource in the application.
Second, the actual words are formed by a link list with each word represented by a member in the sequence. The first element is a structure (of type SPPHRASEPROPERTY) pointed to by pElements->pProperties and each subsequent structure uses the SPPHRASEPROPERTY抯 pNextSibling member. Traveling this chain is standard link list operation.
case VID_EspressoDrinks:
// This memory will be freed when the WM_ESPRESSOORDER
ULONG *pulIds = new ULONG[MAX_ID_ARRAY];
const SPPHRASEPROPERTY *pProp = NULL;
int iCnt = 0;
if ( pulIds )
{
ZeroMemory( pulIds, MAX_ID_ARRAY * sizeof(ULONG) );
pProp = pElements->pProperties;
// Fill in an array with the drink properties received
while ( pProp && iCnt < MAX_ID_ARRAY )
{
pulIds[iCnt] = static_cast< ULONG >(pProp->vValue.ulVal);
pProp =pProp->pNextSibling;
iCnt++;
}
PostMessage(hWnd, WM_ESPRESSOORDER, NULL, (LPARAM) pulIds );
To inspect the elements, the code steps through the links one node at a time until the next node is NULL (meaning there are no more nodes to transverse) or it has already visited at least MAX_ID_ARRAY number of nodes. CoffeeS1 imposes this MAX_ID_ARRAY limitation.
Besides stepping through the link list, this code also stores the words in an internal array for later processing. This not only keeps a record of the words but also helps with sorting. Remember, don抰 worry about word order. Customers can say 揼et me a mocha two percent tall,� and still end up with a tall two percent mocha. However, if you do change the word order then you need the ability to sort internally. To indicate empty array elements, flag them with a zero, hence the Win32 ZeroMemory()call. You can use other methods; this one was just convenient for this example.
After going through the list, CoffeeS1 is ready to display the newly derived information. A message is passed to the owning window (WM_ESPRESSOORDER) indicating the application has additional processing. At this point, SAPI is no longer involved. SAPI will even free the objects it created although CoffeeS1 must manually free pElements since it manually created it. Even so, COM is smart enough to delete any nodes in the link list associated with the list. The rest of the processing is on CoffeeS1抯 part and mostly to update the screen. When you speak again, the whole process above is repeated.
Grammar Files
As mentioned, the Coffee examples use command and control grammar. This is a discrete list of words associated with certain rules. Coffee keeps this list in two forms. An XML-based file allows you to maintain this list. Ultimately, SAPI can only read a binary or compiled version of that file. This is a grammar configuration that is saved with the .cfg file suffix. It was by clever design that CFG not only means 揷onfiguration,� but also 揷ontext-free grammar.� Approbation aside, grammar files may be generated dynamically during the application抯 run time. If it is provided with only an xml file, SAPI will compile the file automatically and use the resulting grammar. On the other hand, the grammar may also be compiled ahead of time by the programming team. This restricts access to the vocabulary so users cannot change grammars unexpectedly. This method is also faster for applications since no compiling time is required during operation. The SAPI SDK application provides a compiler called GramComp. Grammar compilation using this tool is documented separately.
SAPI defines the XML tags and their uses and lists them in Reference API. For a more complete discussion, see Text grammar format. As a brief overview of the structure, look at coffee.xml in the CoffeeS1 project. There are several rules defined but only two are considered top level: VID_EspressoDrinks and VID_Navigation. These are the significant rules for SAPI. When a rule match is made, it is one of these IDs that is passed back to the application. Also, look at ExecuteCommand(). The two case statements coincide with the top-level rule names.
The TOPLEVEL tag within the RULE statement gives these rules their special status. Not only does this identify the rule as being top level, but it also sets the activation state. Only top-level rules may be activated or deactivated. SAPI recognizes active rules and conversely does not recognize deactivated ones. The application may change the state of the rules during execution. If a rule is no longer needed, it may be deactivated. This allows you to turn rules on and off based on the current recognition context. For example, if you have a menu or menu item deactivated, SAPI will not need to attempt to recognize the words associated with it. When the menu is active again, the rule will likewise be activated.
The words or phrases are listed inside the rule. The words or phrases may be optional or required. As the name implies, optional words are not required for a successful rules match. SAPI adds them as a convenience to the speaker. 揚lease enter the shop� is natural and pleasant sounding as opposed the demanding version of the statement. Required words are, of course, required. However, you can present an alternative word list from which any one word can be used to complete the match. In the case of VID_Navigation, you can say either 揺nter,� or 揼o to,� but not both.
In the same manner, you may reference other rules but not other top-level rules. Continuing the VID_Navigation example, the last portion of the requirement is that the rule VID_Place must be successfully matched. The three alternatives, 揷ounter,� 搒hop,� and 搒tore� are defined as VID_Place. If you say one of these three words, the rule is successfully matched. Upon successful completion of all the requirements, the top-level rule, VID_Navigation matches, and an SPEI_RECOGNITION event passes back to the application.
Additional study of coffee.xml helps you understand how complex rules are constructed. The other rules are basically the same format and follow the same structure. Look up unfamiliar tags in the 揟ext grammar format� section of the reference API. Curiously enough, the program is case sensitive. 揟he� and 搕he� may be duplicated as entries. They may even have the same ID such as <ID NAME=擳he� VAL=�1� /> and <ID NAME=攖he� VAL=�1� />. While this case has the same pronunciation, consider other words such as 揚olish� and 損olish.� This applies equally to rule names. There is no requirement for engines to recognize the words as different; however, engine vendors may want to do so. By making the word case sensitive, newer engines can take advantage of these differences.
The first portion of the file assigned numeric values to the individual elements. SAPI does not require this, although in the CoffeeS1 example, you can sort the words. The sorting is based on the 揤AL=� tag. Remember to keep the words actually found in the array pulIds for this purpose.
Activating the rules is the same as in CoffeeS0.