Title of Invention	" A METHOD FOR SELECTIVE DISTRIBUTED SPEECH RECOGNITION AND A WIRELESS DEVICE THEREFOR"
Abstract	An XXX and method for selective distributed speech recognition includes dialog (104) that is capable of receiving a grammar type Indicator (170). The diolog manger (104) is capable of being coopled to an external speech recogni- tion engine (108), which may be disposed of a communication network (162). The apparatus and method further includes an audio receiver (102) coupled to the dialog manager (104) wherein the audio reveiver (104) a speech input (110) and provides an encoded audio input (112) to the dialog manager (104). The method and XXX also includes speech recognition origine (106) coupled to the dialog manager (104), such that the dialog manager (104) selects to distribute the encoded audio input (112) to either the XXX speech recognition, engine (106) to the external speech recognition engine (108) based on the corre- sponding grammar type indicator (170).

Title of Invention

" A METHOD FOR SELECTIVE DISTRIBUTED SPEECH RECOGNITION AND A WIRELESS DEVICE THEREFOR"

Abstract

An XXX and method for selective distributed speech recognition includes dialog (104) that is capable of receiving a grammar type Indicator (170). The diolog manger (104) is capable of being coopled to an external speech recogni- tion engine (108), which may be disposed of a communication network (162). The apparatus and method further includes an audio receiver (102) coupled to the dialog manager (104) wherein the audio reveiver (104) a speech input (110) and provides an encoded audio input (112) to the dialog manager (104). The method and XXX also includes speech recognition origine (106) coupled to the dialog manager (104), such that the dialog manager (104) selects to distribute the encoded audio input (112) to either the XXX speech recognition, engine (106) to the external speech recognition engine (108) based on the corre- sponding grammar type indicator (170).

Full Text	METHOD AND APPARATUS FOR SELECTIVE DISTRIBUTED SPEECH RECOGNITION BACKGROUND OF THE INVENTION [0001] The invention relates generally to speech recognition, and more specifically. to distributed speech recognition between a wireless device and a communication server. [0002] With the growth of speech recognition capabilities, there is a corresponding increase in the number of applications and uses for speech recognition. Different types of speech recognition application and systems have been developed, based upon the location of the speech recognition engine with respect to the user One such example is an embedded speech recognition engine, otherwise known as a local speech recognition engine, such as a Speech2Go speech recognition engine sold by Speech Works International, Inc., 695 Atlantic Avenue, Boston, MA 02111. Another type of speech recognition engine is a network-based Speech recognition engine, such as Speech Works 6, as sold by Speech Works International, Inc, 695 Atlantic Avenue. Boston, MA 02111. [0003] Embedded, or local speech recognition engines provide, the added benefit of reduced latency in recognizing a speech input, wherein a speech input includes any type of audible or audio-based input. One of the drawbacks of embedded or local speech recognition engines is that these engines contain a limited vocabulary. Due to memory limitations and system processing requirements, in conjunction with power consumption limitations, embedded or local speech recognition engines are limited to providing recognition to only a fraction of the speech inputs which would be recognizable by a network-based speech recognition engine. [0004] Network-based speech recognition engines provide the added benefit of an increased vocabulary, based on the elimination of memory and processing restrictions. Although a downside is the added latency between when a user provides a speech input and when the speech input may be recognized, and furthermore 1 WO 2004/061819 PCT/US2003/037898 provided back to the end user for confirmation of recognition. Other disadvantages include the requirement for continuous availability of the communication path, the resulting increased server load, and the cost to the user of connection end service- In a typical speech recognition system, the user provides the speech input and the speech input is thereupon provided to a.server across a connnunication path, whereupon it may then be recognized. Extra latency is incurred in not only transmitting the speech input to the network-based speech recognition engine, but also transmitting the recognized speech input, or an N-bcst list back to the end user. [0005] One proposed solution to overcoming the inherent limitations of embedded speech recognition engines and the latency problems associated with network-based speech recognition engines is to preliminarily attempt to recognize all speech inputs with the embedded speech recognition engine, Thereupon, a determination is made if the local speech recognition engine has properly recognized the speech input, based upon, among other things, a recognition confidence level. If it is deternyned that the speech input has not been recognized by the local speech recognirieui engine, such that A confidence Jevel is below a threshold value, ihE speech input is thereupon provided to a network-based speech recognition engine. This solution, while eliminating laLency issues, with respect to speech inputs that are recognized by the embedded speech recognition engine, adds an extra latency step for all other inputs by first attempting to recognize the speech input iocally. Therefore, when the speech inputs must be recognized using the network-based speech recognition engine, the user is required to incur a further delay, [0006] Another proposed solution to overcoming the limitations of embedded speech recognition engines and network-based speech recognition engines is to attempt to recognize the speech input both at the local level, using the embedded speech recognition engine, and at the server level, using the network-based speech recognition engine. Thereupon, both recognized speech inputs are then compared and the user is provided with a best-guess at the recognized inputs. Once again, this solution requires the usage of the network-based speech recognition engine, which 2 WO 2004/061819 PCT/US2003/037898 may add extra latency if [he speech input is recognizable by the embedded speech recognition engine, BRIEF DESCRIPTION OF THE DRAWINGS [0007] The invention will be mate readily understood with reference to the following drawings wherein: [0008] FIG. 1 illustrates one example of an apparatus for distributed speech recognition; [0009] FIG. 2 illustrates one example of a method for distributed speech recognition; [0010] FIG. 3 illustrates another example of the apparatus for distributed speech recognition; [0011] FIG- 4 illustrates an example of a plurality of grammar type indicators; [0012] FIG. 5 illustrates another example of a method for distributed speech recognition; [0013] FIG. 6 illustrates an example of a method of an application utilising distributed speech recognition; and [0014] FIG. 7 illustrates an example of an embodiment of a method for distributed speech recognition. 3 WO 2004/061819 PCT/US2003/037898 DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT [0015] Briefly, a method and apparatus for selective distributed speech recognition includes receiving a plurality of grammar type indicators, wherein a grammar type indicator is a class of speech recognition patterns associated with a plurality of grammar class entries. The grammar class entries are elements within the class that is defined by the. grammar class- For example, a grammar type indicator may be 'DAYS OF THE WEEK,' containing the grammar type indicator entries of Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday, yesterday and tflmotraw. The gjcarrunai type indicator furthfimujne includes an address to the grammar class stoned within a speech recognition or may include the grammar class itself, consisting of a tagged list of the grammar class entries or may include a Universal Resource Identifier(URI) that points to a resource on the network where the grammar ctass is available. In another embodiment, the grammar type indicator may include a pointer to a specific speech recognition engine having the grammar class therein, [0016] The method for selective distributed speech recognition further includes receiving a speech input that corresponds to one of the grammar class entries. As discussed above, a speech input is any type of audio or audible input, typically provided by an end user, lhat is to be recognized using a speech recognition engine and an action is thereupon to be performed in response to the recogtu'aed speech input. The method and apparatus further limits recognition to either an embedded speech recognition engine or an external speech recognition engine, based on the grammar type indicator. In one embodiment, the embedded speech recognition engine is embedded within die apparatus for distributed speech recognition engine, also referred to ss a local speech recognition engine, as discussed above, and the external speech recognition engine may be a network-based speech recognition engine, also as discussed above. [0017] Thereupon, the method and apparatus selectively distributes the speech input to either the embedded speech recognition engine or the external speech recognition engine, such as the nerwork-based speech recognition engine, based on 4 WO 2004/061819 PCT/US2003/037898 the specific grammar type indicator. More specifically, the speech input is encoded into an encoded audio input and the encoded audio input, which represents an encoding of the speech input, is provided to the selected speech recognition engine. Furthermore, the speech input is expected ID correspond to one of the grammar class entries for the specific grammar Eype indicator. [0018] FIG. I illustrates a wireless device 100 That includes an audio receiver 102, a dialog manager 104, such as a multi-modal browser Or a voice browser, and a first speech recognition engine 106, such as an embedded speech recognition engine. The wireless device 100 may be any device capable of receiving communication from a wireless or non-wireless device or network, a server or other communication network. The wireless device 100 includes, but is not limited to4 B. client device such as a cellular phone, a laptop computer, a desktop computer, a pager, a smart phone, or other wireless devices such as a personal digital assistant, or any other suitable device capable of receiving communication as recognized by one having ordinary skill in the art. The dialog manager lO4t which may be a muiti-moda! browser capable of reacting and outputting mark-up language for multiple modes, such as, but not limited to, graphic and voice mode, is operably coupleable to a second speech recognition engine IDS. such as an external speech recognition engine which may be a network based speech recognition engine. [0019] In one embodiment, the dialog manager 104 is operably couplcable to the second speech recognition engine 10S through a communication network, not shown. Furthermore, the second speech recognition engine 108 may be disposed on a communication server, rot shown, wherein a communication server includes any type of server in communication with the communication network, such as communication through an internet, an intranet, a proprietary server, or any other recognized communication path for providing communication between the wireless device 100 and the communication server, as illustrated below in FIG. 3, [0020] The audio receiver 102 receives a speech input 110, such as provided from an end user. The audio receiver 102 receives the speech input 110, encodes the speech input 110 to generate an encoded audio input 112 and provides the encoded 5 WO 2004/061819 PCT/US2003/037898 audio input 112 to the dialog manager 104. The dialog manager 104 receives a plurality of grammar type indicators 114. As discussed below, the grammar type indicators may be provided across the communication network (not shown), from one or more local processors executing a local application disposed within the communication device, or may be provided from any other suitable location any recognized by one having ordinary skill in the art. [0021] The dialog manager 104 receives the encoded audio input 112 from the audio receiver 102 and, based on the grammar type indicators 114, selects either the first speech recognition engine 106 or the second speech recognition engine JOB to recognize the encoded audio input IJ2. As discussed below, the grammar type indicators contain indicators as to which speech recognition engine should be utilized to recognize a speech input, based on the complexity of the expected speech input and the abilities and/or limitations of the first speech recognition engine 106- When the encoded audio input 116 is thereupon provided to the first speech recognition engine 106 disposed within the wireless device 100, the speech recognition is performed within the wireless device 100. When the encoded audio input 118 is provided to the second speech recognition engine 10B, Ehe encoded audio input 118 is transmitted across a communication interface, not shown, due to the second speech recognition engine 108 being external to the wireless device 100. As recognized by one having ordinary sfcitl in the art, elements within the cornmuriication device 100 have been omitted fnpmHG. 1 for clarity purposes only. [0022] FIG. 2 illustrates a flow chart representing ihe steps of the method for distributed speech recognition. The method begins 130 by receiving a grammar type indicator having one or more grammar class entries, such as the grammar type indicators 114 of FIG, I. wherein the grammar type indicator is associated with an information request, step 132. In the above example, the grammar type indicator may represent days of the week and the grammar type indicator entries are the possible elements of the class defined by the grammar type indicator, such as Monday, Tuesday, et- al. In another embodiment, the grammar type Indicator may be a grammar indicator, such as a universal resource identifier {URTJ to a specific grammar 6 WO 2004/061819 PCT/US2003/037898 class, Moreover in another embodiment, the grammar type indicator may be a pointer to a specific speech recognition engine having ihe specific gramma* class disposed therein. Next, step 134, a speech input corresponding to one of the grammar class entries of the grammar type indicator is received, in response to ihe information request. This speech input, such as encoded audio input ] 12 corresponds to one of the entries in the grammar type indicator based upon a user prompt provided io the end user across Che client device 100. In other words, the user is requested (o provide a speech input U0 that is expected to fall within (he grammar class. [0023] Thereupon, step 136, speech recognition is limited to the embedded speech recognition engine or the external speech recognition engine based on the grammar type indicator in comparison io a grammar type capability signal. A grammar type capability signal includes an indication of recognition complexity level of the embedded speech recognition engine. The recognition complexity \cva) corresponds to how many words, or phrases the speech fecognizer can handle using the available device resources, The recognition complexity increases as the racQgrtiiaWe lartguagt s« msreaees. Usually rtva T far the speech recognizer needs as a. finite state network of nodes and arcs. The recognition complexity level would be, for example, that the recognition is limited to euch networks of 50 nodes. There exiEte other implementations awi variations of thfl recognition complexity level that could be applied and would fall within the scope of this disclosure. As such, the speech recognition ta be performed by cither the embedded speech recognition engine 106 or the external speech recognition engine 114 is thereupon selectively distributed based upon the expected complexity of the speech input 110 as determined by the grammar type indicator 114 and the grammar type indicator entries, step 208. [0024] FIG. 3 illustrates the apparatus for selective distributed speech recognition of FIG, 1 with a communication network 140 and an information network 142, wherein the information network 142 includes acommunicaiion server 144, the external speech recognition engine 108 and 3 content backend 146. The communication network 240 may be a wireless area network, a wireless local area 7 WO 2004/061819 PCT/US2003/037898 network, a. cellular communication network, or any other suitable network for providing communication information between the wireless device 100 and the information network 142 as recognized by one having ordinary skill in rhe art. The information network 143 may be an internet, an intranet, a proprietary network, or any other network allowing for the communication of the content backend 146 with the communication server 144 and the communication server 144 with the external speech recognition engine 108. Moreover, the content backend 146 includes any type of database or executable processor wherein content information ]48 may be provided to the communication serve* 144, either automatically, upon request from the communication server, or in response to any other request as provided thereto, as recognised by one having ordinary skill in the art, [0025] The wireless device 100 includes the audio receiver 102, the dialog manager 104, the embedded speech recognition engine 108, a processor 150, a memory 152, an output device 154, and a communication interface 156 for interfacing across the communication network 140. The processor 150 may be, but not limited to, a single processor, a plurality of processors, a DSP, a microprocessor, ASIC, state machine, or any other implementation capable of processing and executing software or discrete logic or any suitable combination of hardware, software and/or firmware. The term processor should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include DSP hardware, ROM for storing software, RAM, and any other volatile arnon-volatile storage medium. The memory 152 may be, but not limited to. a single memory, a plurality of memory locations, shared memory, CD, DVD, ROM, UAM, EEPRCM, optical sKsragt, QT any other non-volatile storage capable of storing digital data for use by the processor 150, The output device 154 may be a speaker for audio output, a display or monitor for video output, or any other suitable interface for providing an output, as recognized by one having ordinary skill in the art. [0026] In one embodiment, the wireless device 100 provides an embedded speech recognition engine capability signal to the communication server 144 through the communication network 140. The embedded speech recognition engine capability 8 WO 2004/061819 PCT/US2003/037898 signal indicates the level of complexity of encoded audio inputs that the embedded speech recognition can handle, such as a limited number of finite state grammar (FSQ) nodes. The communication server 144, in response to the embedded speech recognition engine capability signal, provides 3 plurality of grammar type indicators to [he dialog manager, wherein each grammar type indicaEor includes an indicator as to which speech recognition is to be utilized for r&cognizing the corresponding encoded audio input, in one embodiment, the grammar type indicators are embedded within a mark-up language page, such thai the dialog manager 104 receives the mark- up language page and thereupon constructs an ordered interface for use by an end user, such as a multiple entry form, wherein the dialog manager, in response to the mark-up language, requests a first entry, upon receipt and confirmation, requests a second entry, and thereupon further entries as indicated by the mark-up page. [0027] En another embodiment the dialog manager 104 may be disposed within the communication server 142 such that it controls the dispatch of the mark-up content from the content back end 146 to the client device over the network 140 and il is coupled with some client mark-up browser. For example, a Voice XML browser similar lo the dialog manager 104, may be disposed »n the eommimication server 144, GUI browser may be disposed on the wireless device I DO with submodule for selection of recognition engine, [0028] Referring now to FIG. 4 for further delineation, FIG. 4 illustrates three exemplary grammar classes with a plurality of grammar class entries. The first grammar class 170 contains days of the week, having grammar class entries of Monday 170a, Tuesday 170b, Wednesday 170c, el. aL The second grammar class 172 contains names of mutual funds, as may be provided from a financial services communication server, such as the communication server 144. The second grammar class entries are names of various mutual funds that a user may select, such as Mutual Fund 1 172ar Mutual Fund 2 172b. A third grammar class 174 contains numbers as the grammar class entries, such as a user may enter for purposes of an account number, a personal identification number, a quantity number, or any other suitable numerical input, such as one 174a, two 174b and ten 174c. 9 WO 2004/061819 PCT/US2003/037898 [0029] Referring back now to FIG. 3, the dialog manager 104 in response to chc mark-up language page, provides an output request 160 to the output device 154, The output device 154 thereupon provides an output to an end user, no! shown. In response to the output device 154, the end user provides a speech input 110 to the audio receiver 102. Similar to the above description with respect to FIG. 1T the audio receiver 102 encodes the speech input i 10 into an encoded audio inpui 112t which is provided to the dialog manager 104, [0030] The wireless device 100 further includes the processor 150 coupled to the memory 152 wherein the rnemory 152 may provide executable instructions 162 to the processor 150. Thereupon, the processor J50 provides application instructions 164 to me dialog manager 104. The application instructions may contain, for example, instructions to provide connection with the communication server 144 and provide the terminal capability signal Lo the communication server 144. In another embodiment, the processor 150 may be disposed within the dialog manager 104 and receives the- executable instructions 162 directly within the dialog manager 104. [0031] As discussed above, when the dialog manager 104 receives the encoded audio input 112, based upon the grammar 114 of FIG. I, the dialog manager L04 selects either the embedded speech recognition engine 106 or the external speech recognition engine 1QB, When the external speech recognition engine 10S is. selected, the encoded audio input 118 is provided to the interface 156 such that it may be transmitted to the external speech recognition engine 10S across the communication network 140. The interface 156 provides for a wireless communication 166 and thereupon the wireless device 100 may provide a communication 168 to the information network 142, [0032] As recognised by one hiving ordinary skill in the art, the network L4Q may be operably coupled directly to the communication server 144 across communication path 168 and the dialog manager 104 may interface the external speech recognition 108 through the communication server 144 or the dialog manager 104- may be directly coupled through the network interface 156 through the communication network 140. When the external speech recognition engine 108 10 WO 2004/061819 PCT/US2003/037898 receives the encoded audio input 1 IS, the encoded audio input 1L3 is recognized, in accordance with known speech recognition techniques. The recognized audio input 169 is thereupon provided back to the dialog manager 104. Once again, as recognized by one having ordinary skill in the art. the recognized audio inpm 169 may be provided through the communication server 144 through the communStation network 140 and back to the interface 156 within the wireless device 100. [0033] In another embodiment, the embedded speech recognition engine 106 or the external speech recognition ] OS, based upon which engine is selected by the dialog manager 104, may be provided anN-best list to the dialog manager and further level of feedback: may be performed, wherein the user is provided the top choices far recognized audio and thereupon further selects the appropriate recognized input or the user can select an action to correct the input if the desired input is not present, [0034] FIG. 5 illustrates the method for distributed speech recognition in accordance with one embodiment. The method begins 200 by providing a. terminal capability signal to a communication server, wherein the terminal capability signal is provided across a communication network, step 203. As illustrated with respect to FIG. 3. the terminal capability signal is provided from the dialog manager 104 through the interface 156 across the communication network 140 to the communication server 144, In one embodiment, the terminal capability signal is. provided as part of the service session initiation that happens when the wireless device 100 connects to the communication server 144. The next step, step 204, is receiving a mark-tip page having a grammar type indicator having al leasLone grammar class entry with a plurality of grammar class entries associated therewith. The mark-up page may be encoded with any recognized mark-tip language, such as, but not limited to, VoiceXML, SALT and XHTML, with the grammar type indicators, such as grammar type indicators 170, 172 and 174 of FIG. 4. [0035] Thereupon an information request js provided to an output device, wherein the information request seeks a speech input of one of the at !ea&t one grammar class envies, step 206. As discussed with respect to FIG. 3, the information request 160 is provided to the output device 154 and the speech input 110 is typically 11 WO 2004/061819 PCT/US2003/037898 provided from an end user. The next step is receiving si speech input, step 208. The speech input 110 is typically provided by an end user and is expected to correspond to at least one of the grammar class entries, for example with respect to FIG. 4, the speech input would be expected to be ore of the grammar class entries, such as Monday or Tuesday for the first grammar class 170. [0036] An encoded audio input is generated from the speech input, step 210. In one embodiment, XXX audio receiver 102 receives the speech input 110 and thereupon generates the encoded audio input 112. The next step, step 212, is selecting an embedded speech recognition engine OF an external speech recognition engine based on the gramctifir type Indicator. In one embodiment, the dialog manager 104 makes this selection based on the grammar type indicators received within the originaJ mark-up page. Thus, the encoded audio inpui is provided to the selected speech recognition engine. [0037] The next step, step 216, is receiving a recognized voice input from either the embedded speech recognition engine or the external speech recognition engine, based upon which speech recognition engine was chosen and the encoded audio input provided thereto. The dialog manager 104 receives the recognized voice input and associates the recognized voice input as an entry fora specific field. [0038] In one embodiment, the method for selective distributed speech recognition further includes providing a second information request to the output device, in response In the second grammar type indicator, step 2IS. The second information request seeks a second speech input, such as the speech input 110, typically provided by an end user. Thereupon, the second speech input is received within the audio receiver 102, step 220. The audio receiver once again generates a second encoded audio inpul, step 222 and provides the encoded audio input to the dialog manager 104 whereupon she dialog manager once again selects either the embedded speech recognition engine 106 or the external speech recognition engine 105 based on the grammar type indicator, step 224, The second encoded audio input is provided to the selected speech recognition engine, step 226. As such, a second 12 WO 2004/061819 PCT/US2003/037898 recognized audio input is generated and provided back to the dialog manager 104 from the selected speech recognition engine. [0039] Thereupon, the method is complete, step 228, As recognized by one having ordinary still in the art, the method for selective distributed speech recognition Is continued for each grammar type indicator. For example, if the mark-up page contains ten fields, the dialog manager would seek ten speech inputs and the audio receiver 102 would generate len different encoded audio inputs and the dialog manager 104- would thereupon choose at iftn different intervals for each specific grammar lype indicator which specific speech recognition engine to perform the selective speech recognition. [0040] FIG. 6 illustrates an exemplary method for selective, distributed speech recognition using the embodiment of a financial services network. The method begins, step 230, when a user accesses a network for financial services, step 232. Next, the server acknowledges access and provides Che dialog manager an application specific mark-up page and at least one application specific grammar type indicator, step 234. In response thereto, the dialog manager queries the user for a speech input based on the first grammar class, step 236, [0041] The user provides the audio input to the audio receiver, step 238. The wireless device thereupon distributes the first audio input to the first speech recognition engine based on the first grammar cype indicator, step 240. In this embodiment, the first grammar type indicator contains an indication to have the encoded audio input recognized bv the embedded speech recognition engine based upon the complexity of the grammar class entries, [0042] Next, the dialog manager queries the user for a second speech input based on a. second, grammar class step 242. The user provides the second audio input to the audio receiver, step 244. The wireless device distributes the second audio input to the second speech recognition engine based on the second grammar type indicator, step 246, wherein the second gtiammar type indicator indicates a level of complexity beyond the speech recognition capabilities of the embedded speech recognition 13 WO 2004/061819 PCT/US2003/037898 engine. Once again, the dialog manager queries the user for a third speech input, this lime based on a third grammar class, step 248, The user provides the third audio input LO the audio receiver, step 250. The wirtle&s device distributes the third audio input to ihe first speech recognition engine based on the third grammar type indicator, wherein the third grammar type indicator, similar to the fits! grammar type indicator indicates recognition capabilities wi in in the ability of the embedded speech recognition engine 106. Thereupon, the method is complete, step 254 and all of the entries for the application specific mark-up page have been completed, [0043] .FIG. 7 illustrates one example of another embodiment of a method for selective distributed speech recognition. The method begins, step 2G0, by receiving an embedded speech recognition engine capability signal, step 262, As discussed above, the embedded speech recognition engine capability signal indicates the level of complexity of which the embedded speech recognition engine within the wireless device may properly and effectively recognize an included audio input. The next step, step 262, includes retrieving a mark-up page having at least one entry field, wherein at least one of the entry fields includes at least one of a plurality of grammar classes associated therewith. The at least one entry field includes fields for an interactive mark-up page wherein a user typically provides an input to the entry field. [0044] The next step is comparing the at least one of the plurality of grammar classes with the embedded speech recognition engine capability signal, step 266, Thereupon, for each entry field having at least one of the plurality of grammar classes associated therewith, assigning either the embedded speech recognition engine or an external speech recognition engine to conduct die speech recognition, based upon the embedded speed recognition capability signal, step 26S. [0045] Thereupon, for each entry field having at least one of the plurality of grammar classes associated therewith, the method includes inserting a grammar type indicator within the mark-up page, wherein the grammar type indicator includes either a grammar class, a grammar indicator, a speech recognition pointer, or any other suitable notation capable of directing a dialog manager or multi-modaJ browser to a particular speech recognition engine, step 270. 14 WO 2004/061819 PCT/US2003/037898 [0046] There upon, step 272, the mark-up page is provided to a wireless device, such a$ the wireless device 100 of FIG- U Furthermore, the method includes receiving an encoded audio input for each of the Entry fields having a grammar type indicator that indicates the selection of the external speech recognition engine, step 274, [0047] The method further includes providing th& encoded audio input to the external speech recognition engine, step 276. Thereupon, the encoded audio input is recognised, step XJ8 and a recognized, audio input is provided tf> ttie 'wireless device, step 230. Thereupon, the method Tor selected distribution from the perspective of a communication server, such as communication server 144 of FIG, 3 is complete. [0048] In another embodiment, the grammar type indicator, such as 170. is embedded within the mark-up page provided to the wireless device 100, such that the wireless device 100 may selectively choose which speech recognition engine is coabled based an art ambaddad speech tsaognUiart engine capability signal. Furthermore, one embodiment allows for a user to override the selected speech recognition through the active de>$c lection of the selected speech recognition engine. For example, the embedded speech recognition 106 may be unreliable due to excess ambient noise, therefore even though [he embedded speech recognition engine 106 may be selected, the external speech recognition 108 may be utilised. In another embodiment, the wireless device 100 may provide a zero capability signal which represents the terminal capability signal indicates the embedded speech recognition engine 106 have zero recognition capability, in essence providing for all speech recognition to be per formed by the external speech recognition engine 108, [0049] It should be understood that there exists implementations of other variations and modifications of the invention and its various aspects, as may be readily apparent to those of ordinary skill in the art, and that the invention is not limited by the specific embodiments described herein. For example, a plurality of external speech recognition engines may be utilized across a communication network !40 such that further levels of selective distributed speech recognition may be performed on the communication server side in that a server-side speech recognition 15 WO 2004/061819 PCT/US2003/037898 engine maybe more aptly suitefl Fora particular input such as nunibers, and then* still exists the original determination of whether thE encoded audio input may be ttcognized 'with iht tmbedii&d speech recognition engJTie LOG tvr \s> outaidt &f the. embedded speech recognition engine 106 capabilities. It is therefore contemplated and covered by the present invention any and all modifications, variations, or equivalence that fall within the spirit and scope of the basic underlying principals disclosed and claimed herein. 16 WO 2004/061819 PCT/US2003/037898 CLAIMS What is claimed is: 1. A method for selective distributed speech recognition comprising: receiving a grammar type indicator associated with an information request" receiving a speech input in response to the information request; and limiting speech recognition to at least one of; a first speech recognition engine and at least one second speech recognition engine, based on the grammar type indicator in comparison to s. grammar type capability of the; embedded speech recognition engine, 2 The method of claim further comprising: providing the speech input to the selected at least one of the following the first speech recognition engine and the at least one second speech recognition engine. 3. The method of claim 2 further comprising: generating an encoded audio input from the- speech input; and associating the encoded audio input as a response to the information request 4. The method of claim 1, therein the grammar type indicator includes at least one of the following; a grammar class, a grammar indicator and 3 speech recognition pointer. 17 WO 2004/061819 PCT/US2003/037898 5. The method of claim I further comprising: prior to receiving the grammar type indicator, accessing a server and providing a terminal capability signal to the server; and receiving the grammar type indicator from [he server in response io the terminal capability signal. 6. The method of claim 5 further comprising: receiving a mark-up page including the grammar type indicator and an ordering scheme for the information request. 7. The method of claim I wherein the first speech recognition engine is an embedded speech recognition engine and the at least one second speech recognition engine is an at least one external speech recognition engine, wherein the embedded speech recognition engine is disposed within a wireless device and the at least one external speech recognition engine is disposed on a communication server, 18 WO 2004/061819 PCT/US2003/037898 8. A wireless device comprising: a dialog manager capable of receiving a grammar type indicator, the dialog manager being operably coup [table 10 at least one external speech recognition engine; an audio receiver operably coupled to the dialog manager such that the audio receiver receives a speech input and provides an encoded audio input to the dialog manager; and An embedded speech recognition engine operably coupled lo the dialog manager such that the dialog manager provides the encoded audio input to at least ore of the following: the embedded speech recognition engine and the at least one external speech recognition engine, based the grammar type indicator in response to a grammar type capability of the embedded speech recognition engine. 9. The wifeless device of claim S wherein: the grammar type indicator includes at least one of the following: a grammar class, a grammar indicator that indicates the grammar class and a speech recognition pointer that points to at least one of the following: the embedded speech recognition and the at least ore external speech recognition which contain the grammar class; and the grammar class includes a plurality of grammar Class entries 10. The wireless device of claim % wherein the dialog manager is operably coupleable to the at least one external speech recognition engine through a communication network. 11. The; wireless device of claim 9 wherein the- grammar type indicator; is received from a communication server through a communication network. 19 WO 2004/061819 PCT/US2003/037898 12. The tireless device of claim II further comprising: a communication interface operably coupled to the dialog manager such that the dialog manager may receive the grammar type indicator from the communication server, 13. The wireless device of claim 9 further comprising: an output device operably coupled lo the dialog manager such that the output device may output a data request in response to the grammar type indicator, such that the speech input is expected to correspond To at least one of the grammar type entries, 20 An XXX and method for selective distributed speech recognition includes dialog (104) that is capable of receiving a grammar type Indicator (170). The diolog manger (104) is capable of being coopled to an external speech recogni- tion engine (108), which may be disposed of a communication network (162). The apparatus and method further includes an audio receiver (102) coupled to the dialog manager (104) wherein the audio reveiver (104) a speech input (110) and provides an encoded audio input (112) to the dialog manager (104). The method and XXX also includes speech recognition origine (106) coupled to the dialog manager (104), such that the dialog manager (104) selects to distribute the encoded audio input (112) to either the XXX speech recognition, engine (106) to the external speech recognition engine (108) based on the corre- sponding grammar type indicator (170).

Full Text

METHOD AND APPARATUS FOR SELECTIVE
DISTRIBUTED SPEECH RECOGNITION
BACKGROUND OF THE INVENTION
[0001] The invention relates generally to speech recognition, and more
specifically. to distributed speech recognition between a wireless device and a
communication server.
[0002] With the growth of speech recognition capabilities, there is a
corresponding increase in the number of applications and uses for speech recognition.
Different types of speech recognition application and systems have been developed,
based upon the location of the speech recognition engine with respect to the user
One such example is an embedded speech recognition engine, otherwise known as a
local speech recognition engine, such as a Speech2Go speech recognition engine sold
by Speech Works International, Inc., 695 Atlantic Avenue, Boston, MA 02111.
Another type of speech recognition engine is a network-based Speech recognition
engine, such as Speech Works 6, as sold by Speech Works International, Inc, 695
Atlantic Avenue. Boston, MA 02111.
[0003] Embedded, or local speech recognition engines provide, the added
benefit of reduced latency in recognizing a speech input, wherein a speech input
includes any type of audible or audio-based input. One of the drawbacks of
embedded or local speech recognition engines is that these engines contain a limited
vocabulary. Due to memory limitations and system processing requirements, in
conjunction with power consumption limitations, embedded or local speech
recognition engines are limited to providing recognition to only a fraction of the
speech inputs which would be recognizable by a network-based speech recognition
engine.
[0004] Network-based speech recognition engines provide the added benefit
of an increased vocabulary, based on the elimination of memory and processing
restrictions. Although a downside is the added latency between when a user provides
a speech input and when the speech input may be recognized, and furthermore
1

WO 2004/061819 PCT/US2003/037898
provided back to the end user for confirmation of recognition. Other disadvantages
include the requirement for continuous availability of the communication path, the
resulting increased server load, and the cost to the user of connection end service- In
a typical speech recognition system, the user provides the speech input and the speech
input is thereupon provided to a.server across a connnunication path, whereupon it
may then be recognized. Extra latency is incurred in not only transmitting the speech
input to the network-based speech recognition engine, but also transmitting the
recognized speech input, or an N-bcst list back to the end user.
[0005] One proposed solution to overcoming the inherent limitations of
embedded speech recognition engines and the latency problems associated with
network-based speech recognition engines is to preliminarily attempt to recognize all
speech inputs with the embedded speech recognition engine, Thereupon, a
determination is made if the local speech recognition engine has properly recognized
the speech input, based upon, among other things, a recognition confidence level. If it
is deternyned that the speech input has not been recognized by the local speech
recognirieui engine, such that A confidence Jevel is below a threshold value, ihE speech
input is thereupon provided to a network-based speech recognition engine. This
solution, while eliminating laLency issues, with respect to speech inputs that are
recognized by the embedded speech recognition engine, adds an extra latency step for
all other inputs by first attempting to recognize the speech input iocally. Therefore,
when the speech inputs must be recognized using the network-based speech
recognition engine, the user is required to incur a further delay,
[0006] Another proposed solution to overcoming the limitations of embedded
speech recognition engines and network-based speech recognition engines is to
attempt to recognize the speech input both at the local level, using the embedded
speech recognition engine, and at the server level, using the network-based speech
recognition engine. Thereupon, both recognized speech inputs are then compared and
the user is provided with a best-guess at the recognized inputs. Once again, this
solution requires the usage of the network-based speech recognition engine, which
2

WO 2004/061819 PCT/US2003/037898
may add extra latency if [he speech input is recognizable by the embedded speech
recognition engine,
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] The invention will be mate readily understood with reference to the
following drawings wherein:
[0008] FIG. 1 illustrates one example of an apparatus for distributed speech
recognition;
[0009] FIG. 2 illustrates one example of a method for distributed speech
recognition;
[0010] FIG. 3 illustrates another example of the apparatus for distributed
speech recognition;
[0011] FIG- 4 illustrates an example of a plurality of grammar type indicators;
[0012] FIG. 5 illustrates another example of a method for distributed speech
recognition;
[0013] FIG. 6 illustrates an example of a method of an application utilising
distributed speech recognition; and
[0014] FIG. 7 illustrates an example of an embodiment of a method for
distributed speech recognition.
3

WO 2004/061819 PCT/US2003/037898
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT
[0015] Briefly, a method and apparatus for selective distributed speech
recognition includes receiving a plurality of grammar type indicators, wherein a
grammar type indicator is a class of speech recognition patterns associated with a
plurality of grammar class entries. The grammar class entries are elements within the
class that is defined by the. grammar class- For example, a grammar type indicator
may be 'DAYS OF THE WEEK,' containing the grammar type indicator entries of
Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday, yesterday and
tflmotraw. The gjcarrunai type indicator furthfimujne includes an address to the
grammar class stoned within a speech recognition or may include the grammar class
itself, consisting of a tagged list of the grammar class entries or may include a
Universal Resource Identifier(URI) that points to a resource on the network where
the grammar ctass is available. In another embodiment, the grammar type indicator
may include a pointer to a specific speech recognition engine having the grammar
class therein,
[0016] The method for selective distributed speech recognition further
includes receiving a speech input that corresponds to one of the grammar class entries.
As discussed above, a speech input is any type of audio or audible input, typically
provided by an end user, lhat is to be recognized using a speech recognition engine
and an action is thereupon to be performed in response to the recogtu'aed speech input.
The method and apparatus further limits recognition to either an embedded speech
recognition engine or an external speech recognition engine, based on the grammar
type indicator. In one embodiment, the embedded speech recognition engine is
embedded within die apparatus for distributed speech recognition engine, also
referred to ss a local speech recognition engine, as discussed above, and the external
speech recognition engine may be a network-based speech recognition engine, also as
discussed above.
[0017] Thereupon, the method and apparatus selectively distributes the speech
input to either the embedded speech recognition engine or the external speech
recognition engine, such as the nerwork-based speech recognition engine, based on
4

WO 2004/061819 PCT/US2003/037898
the specific grammar type indicator. More specifically, the speech input is encoded
into an encoded audio input and the encoded audio input, which represents an
encoding of the speech input, is provided to the selected speech recognition engine.
Furthermore, the speech input is expected ID correspond to one of the grammar class
entries for the specific grammar Eype indicator.
[0018] FIG. I illustrates a wireless device 100 That includes an audio receiver
102, a dialog manager 104, such as a multi-modal browser Or a voice browser, and a
first speech recognition engine 106, such as an embedded speech recognition engine.
The wireless device 100 may be any device capable of receiving communication from
a wireless or non-wireless device or network, a server or other communication
network. The wireless device 100 includes, but is not limited to4 B. client device such
as a cellular phone, a laptop computer, a desktop computer, a pager, a smart phone, or
other wireless devices such as a personal digital assistant, or any other suitable device
capable of receiving communication as recognized by one having ordinary skill in the
art. The dialog manager lO4t which may be a muiti-moda! browser capable of
reacting and outputting mark-up language for multiple modes, such as, but not limited
to, graphic and voice mode, is operably coupleable to a second speech recognition
engine IDS. such as an external speech recognition engine which may be a network
based speech recognition engine.
[0019] In one embodiment, the dialog manager 104 is operably couplcable to
the second speech recognition engine 10S through a communication network, not
shown. Furthermore, the second speech recognition engine 108 may be disposed on a
communication server, rot shown, wherein a communication server includes any type
of server in communication with the communication network, such as communication
through an internet, an intranet, a proprietary server, or any other recognized
communication path for providing communication between the wireless device 100
and the communication server, as illustrated below in FIG. 3,
[0020] The audio receiver 102 receives a speech input 110, such as provided
from an end user. The audio receiver 102 receives the speech input 110, encodes the
speech input 110 to generate an encoded audio input 112 and provides the encoded
5

WO 2004/061819 PCT/US2003/037898
audio input 112 to the dialog manager 104. The dialog manager 104 receives a
plurality of grammar type indicators 114. As discussed below, the grammar type
indicators may be provided across the communication network (not shown), from one
or more local processors executing a local application disposed within the
communication device, or may be provided from any other suitable location any
recognized by one having ordinary skill in the art.
[0021] The dialog manager 104 receives the encoded audio input 112 from the
audio receiver 102 and, based on the grammar type indicators 114, selects either the
first speech recognition engine 106 or the second speech recognition engine JOB to
recognize the encoded audio input IJ2. As discussed below, the grammar type
indicators contain indicators as to which speech recognition engine should be utilized
to recognize a speech input, based on the complexity of the expected speech input and
the abilities and/or limitations of the first speech recognition engine 106- When the
encoded audio input 116 is thereupon provided to the first speech recognition engine
106 disposed within the wireless device 100, the speech recognition is performed
within the wireless device 100. When the encoded audio input 118 is provided to the
second speech recognition engine 10B, Ehe encoded audio input 118 is transmitted
across a communication interface, not shown, due to the second speech recognition
engine 108 being external to the wireless device 100. As recognized by one having
ordinary sfcitl in the art, elements within the cornmuriication device 100 have been
omitted fnpmHG. 1 for clarity purposes only.
[0022] FIG. 2 illustrates a flow chart representing ihe steps of the method for
distributed speech recognition. The method begins 130 by receiving a grammar type
indicator having one or more grammar class entries, such as the grammar type
indicators 114 of FIG, I. wherein the grammar type indicator is associated with an
information request, step 132. In the above example, the grammar type indicator may
represent days of the week and the grammar type indicator entries are the possible
elements of the class defined by the grammar type indicator, such as Monday,
Tuesday, et- al. In another embodiment, the grammar type Indicator may be a
grammar indicator, such as a universal resource identifier {URTJ to a specific grammar
6

WO 2004/061819 PCT/US2003/037898
class, Moreover in another embodiment, the grammar type indicator may be a
pointer to a specific speech recognition engine having ihe specific gramma* class
disposed therein. Next, step 134, a speech input corresponding to one of the grammar
class entries of the grammar type indicator is received, in response to ihe information
request. This speech input, such as encoded audio input ] 12 corresponds to one of the
entries in the grammar type indicator based upon a user prompt provided io the end
user across Che client device 100. In other words, the user is requested (o provide a
speech input U0 that is expected to fall within (he grammar class.
[0023] Thereupon, step 136, speech recognition is limited to the embedded
speech recognition engine or the external speech recognition engine based on the
grammar type indicator in comparison io a grammar type capability signal. A
grammar type capability signal includes an indication of recognition complexity level
of the embedded speech recognition engine. The recognition complexity \cva)
corresponds to how many words, or phrases the speech fecognizer can handle using
the available device resources, The recognition complexity increases as the
racQgrtiiaWe lartguagt s« msreaees. Usually rtva T far the speech recognizer needs as a. finite state network of nodes and arcs. The
recognition complexity level would be, for example, that the recognition is limited to
euch networks of 50 nodes. There exiEte other implementations awi variations of thfl
recognition complexity level that could be applied and would fall within the scope of
this disclosure. As such, the speech recognition ta be performed by cither the
embedded speech recognition engine 106 or the external speech recognition engine
114 is thereupon selectively distributed based upon the expected complexity of the
speech input 110 as determined by the grammar type indicator 114 and the grammar
type indicator entries, step 208.
[0024] FIG. 3 illustrates the apparatus for selective distributed speech
recognition of FIG, 1 with a communication network 140 and an information network
142, wherein the information network 142 includes acommunicaiion server 144, the
external speech recognition engine 108 and 3 content backend 146. The
communication network 240 may be a wireless area network, a wireless local area
7

WO 2004/061819 PCT/US2003/037898
network, a. cellular communication network, or any other suitable network for
providing communication information between the wireless device 100 and the
information network 142 as recognized by one having ordinary skill in rhe art. The
information network 143 may be an internet, an intranet, a proprietary network, or any
other network allowing for the communication of the content backend 146 with the
communication server 144 and the communication server 144 with the external
speech recognition engine 108. Moreover, the content backend 146 includes any type
of database or executable processor wherein content information ]48 may be provided
to the communication serve* 144, either automatically, upon request from the
communication server, or in response to any other request as provided thereto, as
recognised by one having ordinary skill in the art,
[0025] The wireless device 100 includes the audio receiver 102, the dialog
manager 104, the embedded speech recognition engine 108, a processor 150, a
memory 152, an output device 154, and a communication interface 156 for interfacing
across the communication network 140. The processor 150 may be, but not limited
to, a single processor, a plurality of processors, a DSP, a microprocessor, ASIC, state
machine, or any other implementation capable of processing and executing software
or discrete logic or any suitable combination of hardware, software and/or firmware.
The term processor should not be construed to refer exclusively to hardware capable
of executing software, and may implicitly include DSP hardware, ROM for storing
software, RAM, and any other volatile arnon-volatile storage medium. The memory
152 may be, but not limited to. a single memory, a plurality of memory locations,
shared memory, CD, DVD, ROM, UAM, EEPRCM, optical sKsragt, QT any other
non-volatile storage capable of storing digital data for use by the processor 150, The
output device 154 may be a speaker for audio output, a display or monitor for video
output, or any other suitable interface for providing an output, as recognized by one
having ordinary skill in the art.
[0026] In one embodiment, the wireless device 100 provides an embedded
speech recognition engine capability signal to the communication server 144 through
the communication network 140. The embedded speech recognition engine capability
8

WO 2004/061819 PCT/US2003/037898
signal indicates the level of complexity of encoded audio inputs that the embedded
speech recognition can handle, such as a limited number of finite state grammar
(FSQ) nodes. The communication server 144, in response to the embedded speech
recognition engine capability signal, provides 3 plurality of grammar type indicators
to [he dialog manager, wherein each grammar type indicaEor includes an indicator as
to which speech recognition is to be utilized for r&cognizing the corresponding
encoded audio input, in one embodiment, the grammar type indicators are embedded
within a mark-up language page, such thai the dialog manager 104 receives the mark-
up language page and thereupon constructs an ordered interface for use by an end
user, such as a multiple entry form, wherein the dialog manager, in response to the
mark-up language, requests a first entry, upon receipt and confirmation, requests a
second entry, and thereupon further entries as indicated by the mark-up page.
[0027] En another embodiment the dialog manager 104 may be disposed
within the communication server 142 such that it controls the dispatch of the mark-up
content from the content back end 146 to the client device over the network 140 and il
is coupled with some client mark-up browser. For example, a Voice XML browser
similar lo the dialog manager 104, may be disposed »n the eommimication server 144,
GUI browser may be disposed on the wireless device I DO with submodule for
selection of recognition engine,
[0028] Referring now to FIG. 4 for further delineation, FIG. 4 illustrates three
exemplary grammar classes with a plurality of grammar class entries. The first
grammar class 170 contains days of the week, having grammar class entries of
Monday 170a, Tuesday 170b, Wednesday 170c, el. aL The second grammar class 172
contains names of mutual funds, as may be provided from a financial services
communication server, such as the communication server 144. The second grammar
class entries are names of various mutual funds that a user may select, such as Mutual
Fund 1 172ar Mutual Fund 2 172b. A third grammar class 174 contains numbers as
the grammar class entries, such as a user may enter for purposes of an account
number, a personal identification number, a quantity number, or any other suitable
numerical input, such as one 174a, two 174b and ten 174c.
9

WO 2004/061819 PCT/US2003/037898
[0029] Referring back now to FIG. 3, the dialog manager 104 in response to
chc mark-up language page, provides an output request 160 to the output device 154,
The output device 154 thereupon provides an output to an end user, no! shown. In
response to the output device 154, the end user provides a speech input 110 to the
audio receiver 102. Similar to the above description with respect to FIG. 1T the audio
receiver 102 encodes the speech input i 10 into an encoded audio inpui 112t which is
provided to the dialog manager 104,
[0030] The wireless device 100 further includes the processor 150 coupled to
the memory 152 wherein the rnemory 152 may provide executable instructions 162 to
the processor 150. Thereupon, the processor J50 provides application instructions
164 to me dialog manager 104. The application instructions may contain, for
example, instructions to provide connection with the communication server 144 and
provide the terminal capability signal Lo the communication server 144. In another
embodiment, the processor 150 may be disposed within the dialog manager 104 and
receives the- executable instructions 162 directly within the dialog manager 104.
[0031] As discussed above, when the dialog manager 104 receives the
encoded audio input 112, based upon the grammar 114 of FIG. I, the dialog manager
L04 selects either the embedded speech recognition engine 106 or the external speech
recognition engine 1QB, When the external speech recognition engine 10S is. selected,
the encoded audio input 118 is provided to the interface 156 such that it may be
transmitted to the external speech recognition engine 10S across the communication
network 140. The interface 156 provides for a wireless communication 166 and
thereupon the wireless device 100 may provide a communication 168 to the
information network 142,
[0032] As recognised by one hiving ordinary skill in the art, the network L4Q
may be operably coupled directly to the communication server 144 across
communication path 168 and the dialog manager 104 may interface the external
speech recognition 108 through the communication server 144 or the dialog manager
104- may be directly coupled through the network interface 156 through the
communication network 140. When the external speech recognition engine 108
10

WO 2004/061819 PCT/US2003/037898
receives the encoded audio input 1 IS, the encoded audio input 1L3 is recognized, in
accordance with known speech recognition techniques. The recognized audio input
169 is thereupon provided back to the dialog manager 104. Once again, as recognized
by one having ordinary skill in the art. the recognized audio inpm 169 may be
provided through the communication server 144 through the communStation network
140 and back to the interface 156 within the wireless device 100.
[0033] In another embodiment, the embedded speech recognition engine 106
or the external speech recognition ] OS, based upon which engine is selected by the
dialog manager 104, may be provided anN-best list to the dialog manager and further
level of feedback: may be performed, wherein the user is provided the top choices far
recognized audio and thereupon further selects the appropriate recognized input or the
user can select an action to correct the input if the desired input is not present,
[0034] FIG. 5 illustrates the method for distributed speech recognition in
accordance with one embodiment. The method begins 200 by providing a. terminal
capability signal to a communication server, wherein the terminal capability signal is
provided across a communication network, step 203. As illustrated with respect to
FIG. 3. the terminal capability signal is provided from the dialog manager 104
through the interface 156 across the communication network 140 to the
communication server 144, In one embodiment, the terminal capability signal is.
provided as part of the service session initiation that happens when the wireless
device 100 connects to the communication server 144. The next step, step 204, is
receiving a mark-tip page having a grammar type indicator having al leasLone
grammar class entry with a plurality of grammar class entries associated therewith.
The mark-up page may be encoded with any recognized mark-tip language, such as,
but not limited to, VoiceXML, SALT and XHTML, with the grammar type indicators,
such as grammar type indicators 170, 172 and 174 of FIG. 4.
[0035] Thereupon an information request js provided to an output device,
wherein the information request seeks a speech input of one of the at !ea&t one
grammar class envies, step 206. As discussed with respect to FIG. 3, the information
request 160 is provided to the output device 154 and the speech input 110 is typically
11

WO 2004/061819 PCT/US2003/037898
provided from an end user. The next step is receiving si speech input, step 208. The
speech input 110 is typically provided by an end user and is expected to correspond to
at least one of the grammar class entries, for example with respect to FIG. 4, the
speech input would be expected to be ore of the grammar class entries, such as
Monday or Tuesday for the first grammar class 170.
[0036] An encoded audio input is generated from the speech input, step 210.
In one embodiment, XXX audio receiver 102 receives the speech input 110 and
thereupon generates the encoded audio input 112. The next step, step 212, is selecting
an embedded speech recognition engine OF an external speech recognition engine
based on the gramctifir type Indicator. In one embodiment, the dialog manager 104
makes this selection based on the grammar type indicators received within the originaJ
mark-up page. Thus, the encoded audio inpui is provided to the selected speech
recognition engine.
[0037] The next step, step 216, is receiving a recognized voice input from
either the embedded speech recognition engine or the external speech recognition
engine, based upon which speech recognition engine was chosen and the encoded
audio input provided thereto. The dialog manager 104 receives the recognized voice
input and associates the recognized voice input as an entry fora specific field.
[0038] In one embodiment, the method for selective distributed speech
recognition further includes providing a second information request to the output
device, in response In the second grammar type indicator, step 2IS. The second
information request seeks a second speech input, such as the speech input 110,
typically provided by an end user. Thereupon, the second speech input is received
within the audio receiver 102, step 220. The audio receiver once again generates a
second encoded audio inpul, step 222 and provides the encoded audio input to the
dialog manager 104 whereupon she dialog manager once again selects either the
embedded speech recognition engine 106 or the external speech recognition engine
105 based on the grammar type indicator, step 224, The second encoded audio input
is provided to the selected speech recognition engine, step 226. As such, a second
12

WO 2004/061819 PCT/US2003/037898
recognized audio input is generated and provided back to the dialog manager 104
from the selected speech recognition engine.
[0039] Thereupon, the method is complete, step 228, As recognized by one
having ordinary still in the art, the method for selective distributed speech recognition
Is continued for each grammar type indicator. For example, if the mark-up page
contains ten fields, the dialog manager would seek ten speech inputs and the audio
receiver 102 would generate len different encoded audio inputs and the dialog
manager 104- would thereupon choose at iftn different intervals for each specific
grammar lype indicator which specific speech recognition engine to perform the
selective speech recognition.
[0040] FIG. 6 illustrates an exemplary method for selective, distributed speech
recognition using the embodiment of a financial services network. The method
begins, step 230, when a user accesses a network for financial services, step 232.
Next, the server acknowledges access and provides Che dialog manager an application
specific mark-up page and at least one application specific grammar type indicator,
step 234. In response thereto, the dialog manager queries the user for a speech input
based on the first grammar class, step 236,
[0041] The user provides the audio input to the audio receiver, step 238. The
wireless device thereupon distributes the first audio input to the first speech
recognition engine based on the first grammar cype indicator, step 240. In this
embodiment, the first grammar type indicator contains an indication to have the
encoded audio input recognized bv the embedded speech recognition engine based
upon the complexity of the grammar class entries,
[0042] Next, the dialog manager queries the user for a second speech input
based on a. second, grammar class step 242. The user provides the second audio input
to the audio receiver, step 244. The wireless device distributes the second audio input
to the second speech recognition engine based on the second grammar type indicator,
step 246, wherein the second gtiammar type indicator indicates a level of complexity
beyond the speech recognition capabilities of the embedded speech recognition
13

WO 2004/061819 PCT/US2003/037898
engine. Once again, the dialog manager queries the user for a third speech input, this
lime based on a third grammar class, step 248, The user provides the third audio input
LO the audio receiver, step 250. The wirtle&s device distributes the third audio input to
ihe first speech recognition engine based on the third grammar type indicator, wherein
the third grammar type indicator, similar to the fits! grammar type indicator indicates
recognition capabilities wi in in the ability of the embedded speech recognition engine
106. Thereupon, the method is complete, step 254 and all of the entries for the
application specific mark-up page have been completed,
[0043] .FIG. 7 illustrates one example of another embodiment of a method for
selective distributed speech recognition. The method begins, step 2G0, by receiving
an embedded speech recognition engine capability signal, step 262, As discussed
above, the embedded speech recognition engine capability signal indicates the level of
complexity of which the embedded speech recognition engine within the wireless
device may properly and effectively recognize an included audio input. The next
step, step 262, includes retrieving a mark-up page having at least one entry field,
wherein at least one of the entry fields includes at least one of a plurality of grammar
classes associated therewith. The at least one entry field includes fields for an
interactive mark-up page wherein a user typically provides an input to the entry field.
[0044] The next step is comparing the at least one of the plurality of grammar
classes with the embedded speech recognition engine capability signal, step 266,
Thereupon, for each entry field having at least one of the plurality of grammar classes
associated therewith, assigning either the embedded speech recognition engine or an
external speech recognition engine to conduct die speech recognition, based upon the
embedded speed recognition capability signal, step 26S.
[0045] Thereupon, for each entry field having at least one of the plurality of
grammar classes associated therewith, the method includes inserting a grammar type
indicator within the mark-up page, wherein the grammar type indicator includes either
a grammar class, a grammar indicator, a speech recognition pointer, or any other
suitable notation capable of directing a dialog manager or multi-modaJ browser to a
particular speech recognition engine, step 270.
14

WO 2004/061819 PCT/US2003/037898
[0046] There upon, step 272, the mark-up page is provided to a wireless
device, such a$ the wireless device 100 of FIG- U Furthermore, the method includes
receiving an encoded audio input for each of the Entry fields having a grammar type
indicator that indicates the selection of the external speech recognition engine, step
274,
[0047] The method further includes providing th& encoded audio input to the
external speech recognition engine, step 276. Thereupon, the encoded audio input is
recognised, step XJ8 and a recognized, audio input is provided tf> ttie 'wireless device,
step 230. Thereupon, the method Tor selected distribution from the perspective of a
communication server, such as communication server 144 of FIG, 3 is complete.
[0048] In another embodiment, the grammar type indicator, such as 170. is
embedded within the mark-up page provided to the wireless device 100, such that the
wireless device 100 may selectively choose which speech recognition engine is
coabled based an art ambaddad speech tsaognUiart engine capability signal.
Furthermore, one embodiment allows for a user to override the selected speech
recognition through the active de>$c lection of the selected speech recognition engine.
For example, the embedded speech recognition 106 may be unreliable due to excess
ambient noise, therefore even though [he embedded speech recognition engine 106
may be selected, the external speech recognition 108 may be utilised. In another
embodiment, the wireless device 100 may provide a zero capability signal which
represents the terminal capability signal indicates the embedded speech recognition
engine 106 have zero recognition capability, in essence providing for all speech
recognition to be per formed by the external speech recognition engine 108,
[0049] It should be understood that there exists implementations of other
variations and modifications of the invention and its various aspects, as may be
readily apparent to those of ordinary skill in the art, and that the invention is not
limited by the specific embodiments described herein. For example, a plurality of
external speech recognition engines may be utilized across a communication network
!40 such that further levels of selective distributed speech recognition may be
performed on the communication server side in that a server-side speech recognition
15

WO 2004/061819 PCT/US2003/037898
engine maybe more aptly suitefl Fora particular input such as nunibers, and then* still
exists the original determination of whether thE encoded audio input may be
ttcognized 'with iht tmbedii&d speech recognition engJTie LOG tvr \s> outaidt &f the.
embedded speech recognition engine 106 capabilities. It is therefore contemplated
and covered by the present invention any and all modifications, variations, or
equivalence that fall within the spirit and scope of the basic underlying principals
disclosed and claimed herein.
16

WO 2004/061819 PCT/US2003/037898
CLAIMS
What is claimed is:
1. A method for selective distributed speech recognition comprising:
receiving a grammar type indicator associated with an information request"
receiving a speech input in response to the information request; and
limiting speech recognition to at least one of; a first speech recognition engine
and at least one second speech recognition engine, based on the
grammar type indicator in comparison to s. grammar type capability of
the; embedded speech recognition engine,
2 The method of claim further comprising:
providing the speech input to the selected at least one of the following the
first speech recognition engine and the at least one second speech
recognition engine.
3. The method of claim 2 further comprising:
generating an encoded audio input from the- speech input; and
associating the encoded audio input as a response to the information request
4. The method of claim 1, therein the grammar type indicator includes at
least one of the following; a grammar class, a grammar indicator and 3 speech
recognition pointer.
17

WO 2004/061819 PCT/US2003/037898
5. The method of claim I further comprising:
prior to receiving the grammar type indicator, accessing a server and providing
a terminal capability signal to the server; and
receiving the grammar type indicator from [he server in response io the
terminal capability signal.
6. The method of claim 5 further comprising:
receiving a mark-up page including the grammar type indicator and an
ordering scheme for the information request.
7. The method of claim I wherein the first speech recognition engine is
an embedded speech recognition engine and the at least one second speech
recognition engine is an at least one external speech recognition engine, wherein the
embedded speech recognition engine is disposed within a wireless device and the at
least one external speech recognition engine is disposed on a communication server,
18

WO 2004/061819 PCT/US2003/037898
8. A wireless device comprising:
a dialog manager capable of receiving a grammar type indicator, the dialog
manager being operably coup [table 10 at least one external speech
recognition engine;
an audio receiver operably coupled to the dialog manager such that the audio
receiver receives a speech input and provides an encoded audio input
to the dialog manager; and
An embedded speech recognition engine operably coupled lo the dialog
manager such that the dialog manager provides the encoded audio
input to at least ore of the following: the embedded speech recognition
engine and the at least one external speech recognition engine, based
the grammar type indicator in response to a grammar type capability of
the embedded speech recognition engine.
9. The wifeless device of claim S wherein:
the grammar type indicator includes at least one of the following: a grammar
class, a grammar indicator that indicates the grammar class and a
speech recognition pointer that points to at least one of the following:
the embedded speech recognition and the at least ore external speech
recognition which contain the grammar class; and
the grammar class includes a plurality of grammar Class entries
10. The wireless device of claim % wherein the dialog manager is operably
coupleable to the at least one external speech recognition engine through a
communication network.
11. The; wireless device of claim 9 wherein the- grammar type indicator; is
received from a communication server through a communication network.
19

WO 2004/061819 PCT/US2003/037898
12. The tireless device of claim II further comprising:
a communication interface operably coupled to the dialog manager such that
the dialog manager may receive the grammar type indicator from the
communication server,
13. The wireless device of claim 9 further comprising:
an output device operably coupled lo the dialog manager such that the output
device may output a data request in response to the grammar type
indicator, such that the speech input is expected to correspond To at
least one of the grammar type entries,
20

An XXX and method for selective distributed speech recognition includes dialog (104) that is capable
of receiving a grammar type Indicator (170). The diolog manger (104) is capable of being coopled to an external speech recogni-
tion engine (108), which may be disposed of a communication network (162). The apparatus and method further includes an audio
receiver (102) coupled to the dialog manager (104) wherein the audio reveiver (104) a speech input (110) and provides an encoded audio input (112) to the dialog manager (104). The method and XXX also includes speech recognition
origine (106) coupled to the dialog manager (104), such that the dialog manager (104) selects to distribute the encoded audio input
(112) to either the XXX speech recognition, engine (106) to the external speech recognition engine (108) based on the corre-
sponding grammar type indicator (170).

Documents:

« Previous Patent

Next Patent »

Patent Number

218679

Indian Patent Application Number

01004/KOLNP/2005

PG Journal Number

15/2008

Publication Date

11-Apr-2008

Grant Date

09-Apr-2008

Date of Filing

27-May-2005

Name of Patentee

MOTOROLA, INC.

Applicant Address

1303 EAST ALGONQUIN ROAD, SCHAUMBURG, IL 60196, UNITED STATES OF AMERICA.

Inventors:

#	Inventor's Name	Inventor's Address
1	ANASTASAKOS, TASOS	1026 MONICA LANE,SAN JOSE, CA 95128, UNITED STATES OF AMERICA.
2	BALASURIYA, SENAKA	1405 CRANE STREET, ARLINGTON HEIGHTS, IL 60004, UNTIED STATES OF AMERICA.
3	VAN WIE, MICHAEL	24 PORTSMOUTH TERRACE #3, ROCHESTER, NEW YORK 14607, UNITED STATES OF AMERICA.

PCT International Classification Number

G01L

PCT International Application Number

PCT/US2003/037898

PCT International Filing date

2003-11-24

PCT Conventions:

#	PCT Application Number	Date of Convention	Priority Country
1	10/334,030	2002-12-30	U.S.A.