Title of Invention	A VOICE RECOGNITION APPARATUS
Abstract	The present inventon relates to a voice recognition (VR) system is disclosed that utilizes a combination of speaker independent (SI) (230 and 232) and speaker dependent (SD) (234) acoustic models. At least one SI acoustic model (230 and 232) is used in combination with at least one SD acoustic model (234) to provide a level of speech recognition performance that at least equals that of a purely SI acoustic model. The disclosed hybrid SI/SD VR system continually uses unsupervised training to update the acoustic templates in the one ore more SD acoustic models (234). The hybrid VR system then uses the updated SD acoustic models (234) in combination with the at least one SI acoustic model (230 and 232) to provide improved VR performance during VR testing.

Title of Invention

A VOICE RECOGNITION APPARATUS

Abstract

The present inventon relates to a voice recognition (VR) system is disclosed that utilizes a combination of speaker independent (SI) (230 and 232) and speaker dependent (SD) (234) acoustic models. At least one SI acoustic model (230 and 232) is used in combination with at least one SD acoustic model (234) to provide a level of speech recognition performance that at least equals that of a purely SI acoustic model. The disclosed hybrid SI/SD VR system continually uses unsupervised training to update the acoustic templates in the one ore more SD acoustic models (234). The hybrid VR system then uses the updated SD acoustic models (234) in combination with the at least one SI acoustic model (230 and 232) to provide improved VR performance during VR testing.

Full Text	VOICE RECOGNITION SYSTEM USING IMPLICIT SPEAKER ADAPTATI ON BACKGROUND Field [1001] The present invention relates to speech signal processing. More particularly, the present invention relates to a navel voice recognition method and apparatus for achieving improved performance through unsupervised training. Background [1002] Voice recognition represents one of tb« mast important techniques to endow a machine with simulated intelligence to'recognize user voiced commands and to facilitate human interface with the. machine. Systems that employ techniques ic recover a linguistic: message from an acoustic speech signal are called voice recognition (VR) systems. FIG, 1 shows a basic VR system having a preemphasis filter 102, an acoustic feature extraction (AFE) unit 104, and a pattern marching engine 110. Ths AFE unit 104 converts H series of digital voice samples into a set of measurement values (for example, extracted frequency components) called an acoustic feature vector The pattern matching engine 110 matches a series of accustfc feature vectors with the templates contained in a VR acoustic mode! 112. VR pattern matching engines generally employ either Dynamic Time Warping (DTW) or Hidden Markov Mode! (HMVI) techniques. Both DTW and HMM are well known in the art, and are described in detail in Rabiner, L R. and Juang, B. H-, FUNDAMENTALS OF SPEECH RECOGNITION, Prentice Hall, 1993. When a series of acoustic features matches a template in the acoustic model 112, the identified template is used to generate a desired format of output, such as an identified sequence of linguistic words corresponding to input* speech. [1003] As: noted above, the acoustic modei 112 is generally either a HMM mode! or a DTW model. A DTW-acoustic model may be thought of as a date bast? of templates associated with each of trie words that need to be recognizee! In general; a DTW template consists of a sequence of feature vectors that has been averaged over many examples of the associated word. DTW pattern matching generally involves locating a stored template that has minimal distance to the input feature vector sequence representing input speech, A template used in an HMM based acoustic mode! contains a detailed statistical description of the associated speech utterance. In general, a HMW template stores a sequence of mean vectors, variance vectors md a set of transition probabilities. These parameters are used to describe the statistics of a speech unit and- are estimated from many examples of the speech unit HMM pattern matching generally involves generating a probability for each template in the model based on the series of input feature vectors associated with the input speech. The template having the highest probability is selected as the most likely input utterance. [10(14] Training" refers to the process of collecting speech samples of a particular speech segment or syllable from one or more speakers in order to gerejrate "templates in the "acoustic model 112. Each template in the acoustic model is associated with a particular word 01 speech segmant called an utterance class. There may be multiple templates In the acoustic model associated uith the same utterance class. Testing* refers to the procedure for matching he templates in the acoustic model to a sequence of feature vectors extracted from input speech. The performance of a given system depends largely upon the degree of match between the input speech of the end-user and the contents, of the database, and hence on the match between the reference templates created through training and the speech samples used for VR testing. [1005] The two common types of training are supervised training and unsupervised training- In supervised training, the utterance class associated with each set of training feature vectors is known a priori. The speaker providing the input speech is often provided with a script of words or speech segnents corresponding to the predetermined utterance classes The feature vectors resulting from the reading of the script may then be incorporated into the acoustic model templates associated with the correct utterance classes. [1006] In unsupervised tralnlngi the utterance class associated with a set of training feature vectors is not known a priori. The utterance class must be correctly Identified before a set of training feature vectors can be incorporated into the correct acoustic model template. In unsupervised training, a mistake in icentlfying the utterance class for a set of training feature vectors can iead to a modifier;ion in the wrong acojstic model template. Such a mistake generally degrades, rather than improves, speech recognition performance. In order to avoid such a mistake, any modification of an acoustic model based on unsupervised training must generally be dene ver/ conservatively, A set of training feature vectors is incorporated into the acoustic model only if there is relatively high confidence that the utterance class has been correctly identified. Such necessary conservatism makes building an SD acoustic model through unsupervised training a very slow process. Until tine SD acoustic model is built in :his way, VR performance wll probably be unacceptable to most users. [1007] Optimally, the end-user provides speech acoustic feature vectors during both training and testing, so that the acoustic model 112 will match strongly with the speech of the end-user. An individualized acoustic model that., is ;ai!ored to a single speaker is also called a speaker dependent (SD) acoustte model. Generating an SD acoustic model generally requires the end-user to provide a large amount of supervised training samples. First, the user must provide training samples for a large variety of utterance classes. Also, in order to achieve the best performance, the end-user must provide multiple templates representing a variety of possible acoustic environments for each utterance class. Because most users are unable or unw King to provide the input speech necessary to generate an 3D acoustic modal, many existing VR systems instead use generalized acoustic models that are trained using the speech of msny "representative" speakers. Such acoustic models are referred to as speaker independent (SI) acoustic models, and are designed to have the best performance over a broad range of users. SI acoustic models, however, may no1.: be optimized to any single user. A VR system that uses an SI acoustic rncde! will not perform as well for a specific user as a VR system that uses an $C acouattc model tailored to that user. For some users, such as those having a strong foreign accents, the performance of a VR system using an SI acoustic model can be so poor that they cannot effectively use VR services at all. [1Q08] Optimally, an SO acoustic model would be generated for each individual user. As discussed above, building SD acoustic models using supervisee! training is impractical. But using unsupervised training to generate a SD acoustic model can take a long time, during which VR performance based on a partial SD acoustic model may be very poor- There is a need in the art for a VR system that performs reasonably well before and during the generation of an SD acoustic model using unsupervised training. SUMMARY [1009] The methods and apparatus disclosed herein are directed to a novel and improved voice recognition (VR) system that utilizes a combination of speaker independent (SI) and speaker dependent (SD) acoustic models. At least one SI acoustic mode,: is used in combination with at ieast one SD acoustic -nodel to provide a level of speech recognition performance that at least egja;s that of a purely SI acoustic model. The disclosed hybrid SI/SO VR system continually uses unsupervised training to update the acoustic templates in the one" or1'more SD acoustic models. The hybrid VR system then uses the updated 3D acoustic models, alone or in combination with the at least one SI acoustic model, to provide improved VR performance during VR testing. [1D10] The word "exemplar/1 is used hsrern to mean "serving as an example, instance, or illustration.n Any embodiment described as an Exemplary embodiment" is not necessarily to be construed as being preferred or advantageous over another embodiment. BRIEF DESCRIPTION OF THE DRAWINGS [1011] The features, objects, ana advantages of the presently disclosed method and apparatus wit* become more apparent from the detailed description set forth below when taken in conjunction witn the drawings in which like reference characters identify correspondingly throughout and wherein: [1012] FIG. 1 shows a basic voice recognition system; [1013] FIG. 2 shows a voice recognition system according to an exemplary embodiment; [1014] FfG, 3 shows a method for performing unsupervised training. [1015] FIG. 4 shows an exemplary approach to generating a combined matching score used in unsupeivised training, [1016] FIG- 5 is a flowchart showing a method for performing voice recognition (testing) using both speaker independent (SI) and speaker dependent (SD) matching scores; [1017] FiG* 6 shows an approach to generating a combined matching score from both speaker independent (SI) and speaker dependent (SD) matching scores; and DETAILED DESCRIPTION [1018] FIG. 2 shows an exemplary embodiment of a hybrid voice recognition (VR) system as might be implemented within a wireless remote station 202. In an exemplary embodiment, the remote station 202 communicates through a wireless channel (not shown) with a wireless communication network (not shewn), -or example, the remote station 202 may be a wireless phone communicating with a wireless phone system. One skilled in the art will recognize that the techniques described herein may be equally applied to a VR system that is fixed (non-portable) or does not involve a wireless channel. [1019] In the embodiment shown, voice signals from a user are converted intc electrical signals in a microphone (MIC) 210 and converted into digital speech samples in an analog-to-digital converter (ADC) 212. The digital sample stream is then filtered using a preemphasis (PE) filter 214, for example a finite impulse response (FIR) filter that attenuates low-frequency signal components-[1020] The filtered samples are then analyzed in an acoustic feature extraction (AFE) unit 216. The AFE unit 216 converts digital voice samples into accustic feature vectors. In an exemplary embodiment, the AFE unit 216 perforins a Fourier Transform on a segment of consecutive digital samples to generate a vector of signal strengths corresponding to different frequency bins. In an exemplary embodiment, the frequency bins have varying bandwidihs in accordance with a bark scale, In a bark scale, the bandwidth of each frequency bin bears a relation to the center frequency of the bin, such that higher-frequency bins have wider frequency bands than lower-frequency bins. The bark scale is described in Rabiner, L. R. and Juang, B. H., FUNDAMENTALS OF SPEECH RECOGNITION, Prentice Hall, 1993 and is well known in the art. [1021] In an exemplary embodiment, each acoustic feature vector is extracted from a series of speech samples collected over a fixed time interval. In an exemplary embodiment, these time intervals overlap. For example, acoustic features may be obtained from 20-mlllisecond intervals of speech data beginning evsry ten milliseconds, such that each two consecutive intervals snare a 10-millisecond segment One skilled in the art would recognize that, the tine intervals might instead be non-overlapping or have non-fixed duration without departing from the scope of the embodiments described herein. [1022] The acoustic feature vectors generated by the AFE unit 218 are provided to a VR engine 220, which performs pattern matching to characterize the acoustic feature vector based on the contents of one or more acoustic models 230, 232, and 234. [1023] In the exemplary-embodiment shown in FIG. 2, three acoustic models are shown: a speaker-independent (SI) Hidden Markov Model (HMM) model • 230, a speaker-independent Dynamic Time Warping (DTW) model 232, and a speaker-dependent (SO) acoustic model 234. One skilled in the art will recognize that different combinations of SI acoustic models may be used in alternate embodiments. For example, a remote station 202 might include just the SiHMM acoustic model 230 and the SD acoustic model 234 and omit the SIDTW acoustic model 232. Alternatively, a remote station 202 might include a single SiHMM acoustic model 230, a SD acoustic model 234 and two different SIDTW acoustic models 232. In addition, one skilled In the ail will recognize that the SD acoustic model 234 may be of the HMM type or the DTW type or a combination of the two. In an exemplary embodiment, the SD acoustic model 234 is a DTW acoustic model. ["024] As described above, the VR engine 220 performs pattern matching to ■dstermine the degree of matching between the acoustic feature vectors and the contents of one or more acoustic models 230, 232t and 234. In an exemplary embodiment, the VR engine 220 generates matching scores based on matching acoustic feature vectors with the different acoustic templates in each of the acoustic models 230,232, and 234. For example, the VR engine 220 generates HMM matching scores based on matching a set of acoustic feature matching :he acoustic feature vectors with multiple DTW templates in the SIDTW acoustic model 232. The VR engine 220 generates matching scores b&sed on matching the acoustic feature vectors with the templates in the SD acoustic model 234 [1325] As described above, each template in an acoustic model is associated with an utterance class. In an exemplary embodiment, the VR er-gine 220 combines scores for templates associated with the same utterance c:ass to create a combined matching score to be used in unsupervised training. For example, the VR engine 220 combines S1HMM and SIDTW scores obtained from correlating an input set of acoustic feature vectors to generate a combined SI score. Based on that combined matching score, the VR engine 220 determines whether to store the input set of acoustic feature vectors as a 3D template in the SD acoustic model ?34 In an exemplary embodiment, unsupervised training to update the SD acoustic model 234 is performed using exclusively SI matching scores. This prevents additive errors that might otherwise result from using an evolving SD acoustic model 234 for unsupervised training of itself. An exemplary method of performing this unsupervised training is described In greater detail below. [1926] In addition to unsupervised training, the VR engine 220 uses the vsrious acoustic models (230, 232, 234) during testing. In an exemplary embodiment, the VR engine 220 retrieves matching scores from the acoustic models (230, 232, 234) and generates combined matching scores for each utterance class. The combined matching scores are used to select the utierance class that best matches the input speech. The VR engine 220 groups consecutive utterance classes together as necessary to recognize whole words or phrases. The VR engine 220 then provides information about the recognized word o* phrase to a control processor 222, which uses the information to determine the appropriate response to the speech information or command. For example, in response to the recognized word or phrase, the control processor 222 may provide feedback to the user through a display or other user interface. In another example, the control processor 222 may send a message through a wireless modern 218 and an antenna 224 to a wireless network (not shown), initiating a mobile phone call to a destination phone number associated with the person whose name was uttered and recognized, [1027] The wireless modem 218 may transmit signals through any of a variety of wireless channel types including CDMA, TDMA, or FDMA. In adcition, the wireless modem 218 may be replaced with other types of communications interfaces that communicate over a non-wireless channel without departing from the scope of the described embodiments. For example, the remote station 202 may transmit signaling information through any of a variety of types of communications channel including land-line modems, T1/E1, ISDN, DSL, ethernet, or even traces on a printed circuit board (PCB). [1028] FIG. 3 is a flowchart showing an exemplary method for performing unsupervised training. At step 302, analog speech data is sampled in an analog-to-digital converter (ADC) (212 in FIG. 2). The digital sample stream is then filtered at step 304 using a preemphasis (PE) filter (214 in FIG. 2). At step 305, input acoustic feature vectors are extracted/from the filtered samples in an acoustic feature extraction (AF'E) unit (216 in FIG. 2), The VR engine (220 in FIG. 2) receives the input acoustic feature vectors from the AFE unit 216 and performs pattern matching o? the input acoustic feature vectors against the contents; cf the SI acoustic models (230 and 232 in FIG. 2). At step 308, the VR engine 220 generates matching scores from ths results of the pattern matching. The VR engine 22fl generates SIHMM matching scores by matching the input acoustic feature vectors with the SIHMM acoustic model 230, and generates SIDTW matching scores by matching the input acoustic feature vectors wiih the SIDTW acoustic model 232. Each acoustic template in the SIHMM and SIDTW acoustic models (230 and 232) is associated with a particular utterance class. A: step 310, SIHMM and SIDTW scores are combined to form combined matching scores. [1029] FIG. 4 shows the generation of combined matching scores for use in unsupervised training, in the exemplary embodiment shown, the speaker independent combined matching score SCOMBRI for a particular utterance class is a weighted sum according to EQN. 1 as shown, where: SIHMMr is tne SIHMM matching score for the target utterance class; SIHMMNT is the next best matching score for a template in the SIHMM acoustic model that is associated with a non-target utterance class (an utterance ctess other than the target utterance class); SIHMMe is the SIHMM matching score For the "garbage" utterance class; SlDTWr is the SIDTW matching score for the target utterance class; SIDTWNT is the next best matching score for a template in the SIDTW acoustic model that is associated with a non-target utterance class; and SIDTWo is the SIDTW matching score for tie "garbage" utterance ciass. [1030] The various individual matching scores SIHMMn and SIDTWn may be viewed as representing a distance value between a series of input acoustic feature vectors and a template In the acoustic model. The greater the distance between the input acoustic feature vectors and a template, the greater the matching score. A close match between a template and the input acoustic feature vectors yields a very low matching score* If comparing a series of input acoustic feature vectors to two templates associated with [1032] Before the VR system can confidently recognise an utterance class a:> the "correct" one, the input acoustic feature vectors should have a higher .degree of matching with templates associated with that utterance class than with garbage templates or templates associated other utterance classes. Combined matching scores generated from a variety of acoustic models can rrore confidently discriminate between utterance classes than matching scores based on only one acoustic model. In an exemplary embodiment the VR system uses such combination matching scores to determine whether to replace a template in the SD acoustic model (234 in FIG. 2) with one derived from a new set of input acoustic feature vectors. [1033] The weighting factors {Wi .. - W$) are selected to provide the best training performance over all acoustic environments. In an exemplary embodiment, the weighting factors (Wi . . . W6) are constant for all utterance classes. In other words, the Wn used to create the combined matching score for a first target utterance class is the same as the Wn value used to create the combined matching score for another target utterance class. In an alternate embodiment, the weighting factors vary based on the target utterance class. Other ways of combining shown in FIG. 4 will be obvious to one skilled in the ait, and are to be viewed as within the scope of the embodiments described herein. For example, more than six or less than six weighted inputs may also bo used. Another obvious variation would he to generate a combined matching score based on one type of acoustic model. For example, a combined matching score could be-generated based on SIHMMT, SIHMMMT, and SIHMMQ. •Or, a combined matching score could be generated based on SIDTWT, SIDTWNT, and SIDTW3. [1034] In an exemplary embodiment, Wi and W4 are negative numbers, and a greater (or less negative) value of Scows indicates a greater degree of matching (smaller distance) between a target utterance class and a series of input acoustic feature vectors. One of skill in the art will appreciate that the signs a: the weighting factors may easily be rearranged such that a greater d-sgree of matching corresponds to a lesser value without departing from the scope of Hie disclosed embodiments. [1035] Turning back to FIG- 3, at step 310, combined matching scores are generated for utterance classes associated with templates in the HMM and DTW acoustic models (230 and 232). In an exemplary embodiment, combined matching scores are generated only for utterance classes associated with the Dsst n SIHMM matching scores and for utterance classes associated with the bast m SIDTW matching scores. This limit may be desirable to conserve computing resources, even though a much larger amount of computing power is consumed while generating the individual matching scores. For example, if n=m=3I combined matching scores are generated for the utterance classes associated with the top three SIHMM and utterance classes associated with the to 3 three siuiw matching scores. Depending on whether the utterance classes associated with the top three SiHMM matching scores are the same as th* utterance classes associated with the top three SIDTW matching scores, this approach will produce three to six different combined matching scores. [1-336] At step 312, the remote station 202 compares the combined matching scores with the combined matching scores stored with corresponding templates (associated with the same utterance class) in the SD acoustic model. If the new series of input acoustic feature vectors has a greater degree of matching than that of an older template stored in the SD model for the same utterance class, then a new SD template is generated from the new series of input acoustic feature vectors. In an embodiment wherein a SD acoustic model is a DTW acoustic model, the series of input acoustic vectors itself constitutes the new SD template. Ths older template is then replaced with the new template, and the combined matching score associated with the new template is stored in the SD acoustic model to be used in future comparisons. \|1037] ' In an alternate embodiment, unsupervised training is used to update ore or mere templates in a speaker dependent hidden markov model (SDHMM) acoustic model. This SDHMM acoustic model could be used either in place of an SDDTW model or in addition to an SDDTW acoustic model within the SD acoustic mode! 234. [1038] In an exemplary embodiment, the comparison at step 312 also includes comparing the combined matching score of a prospective new SD te-nplate with a constant training threshold. Even if there has not yet been any te-npfate stored in a SD acoustic model for a particular utterance class, a new template will not be stored In the SD acoustic model unless it has a combined matching score that Is better (indicative of a greater degree of matching) than th«3 training threshold value. [1339] In an alternate embodiment, before any templates in the SD acoustic model have been replaced, the SD acoustic model is populated by default with templates from the SI acoustic model. Such an initialization provides an alternate approach to ensuring that VR performance using the SD acoustic model will start out at least as good as VR performance using just the S! acoustic model. As more and more of the templates irrthe SD acoustic model are jpdatscl, the VR performance using the. SD acoustic moaei win surpass VK performance using just the SI acoustic model. [1040] in an alternate embodiment, the VR system allows a user to perform supervised raining. The user must put the VR. system into * supervised training mode before performing such supervised training. During supervised training, the VR system has a priori knowledge of the correct utterance class. If the combined matching score for ths input speech is better than the combined matching score for the 3D template previously stored for that utterance class, the:i the ir.put speech is used io form a replacement SD template. In an alternate embodiment, the VR system allows the user to force replacement of existing SD templates during supervised training. [1041] The 3D acoustic model may be designed with room for multiple (two or more) templates for a single utterance class. \r an exemplary embodiment, two templates are stored in the SD acoustic mode! for each utterance class. Th«* comparison at stap 312 therefore entails comparing the matching score obtained with a new tempiats with the matching scores obtained for both templates in the SD acoustic mode! for the same utterance daiss. If the new lenpiate has a better matching score than sither older template in the SD acoustic model, then at step 314 the SD acoustic model template having xhe wcrst matching scoie is replaced with the new template. If the matching score of the ne.w template is no belter than either oider template, then step 314 is skpped. Additionally, at step 312, the matching score obtained with the new template is compared against a matching score threshold. So, until new templates naving a retching score that is belter than the threshold are stored in th* SD sr:oustic (t\cdel the ns-w templates are compared against this threshold value before they will be used to overwrite the prior contents of the SD acoustic model. Obvious variations, such as storing the SD acoustic model templates in sorted order according to combined matching score and comparing new matching scores only with the lowest, ars anticipated and are TO be considered within the scope of the embodiments disclosed herein. Obvious variations on cumbers of templates arored in tho acoustic mode;! for each utterance class are &lso anticipated. For example, the SD acoustic mode! may contain more than two templates for sach utterance class, or may contain different numbers of templates for different utterance classes. [1042] FIG- 5 is a flowchart showing an exemplary method for performing VR testing using a combination of SI and 3D acoustic models. Steps 302, 304, 30Ci, and 308 are the same as described for FIG* 3. The exemplary method diverge; Ircm the method shown in FIG. 3 at step 510. At step 51 G> the VR engine 220 generates SD matching scores basad on comparing the input acoustic feature vectors with templates in the SD acoustic modei. In an exsmpiary embodiment, SD matching scores are generated only for utterance classes associated with the best n SIHMM matching scores and the best rn SIDTW matching scores. In an exemplary embodiment, n-m-3. Depending on the degree of overlap between the two sets of utterance classes, this will result in generation of SD matching scores for three to six utterance classes. As discussed above, the SD acoustic model may contain multiple templates for a single utterance class. At step 512, the VR engine 220 generates hybrid conbined matching scares for use in VR testing. In an exemplary embodiment, thtsse hybrid combined matching scores are based on both individual SI and ind vidua! 3D matching scores. At step 514, the word or utterance having the beiit combined matching score is selected and compared against a testing threshold. An utterance is on'y deemed recognized if its combined matching sec re exceeds this resting threshold. In an exemplary embodiment, the weights [WM ... Wt5] used to generate combined scores for training (as shown in FIG. 4) are equal ;o the weights [W^ . . . V\fe] used to generate combined scores for testing (as shown in FIG, 6), but ihe training threshold is not equal to the testing threshold. [1043] FIG- 6 shows the generation of hybrid combined matching scores performed at step 5^2. The exemplary embodiment shown operates identically to the combiner shown in FIG, 4, except that the weighting factor W.* is applied to OTWT instead of SiDTWr ar»c the weighting factor Ws is applied to DTWNT instead of SIDTVVNT. DTWT (the dynamic time warping matching score for ine target utterance class) is selected from the best of the SIDTW and SDDTW scores associated with the target utterance class. Similarly, DTWNT (the dynamic tins warping matching score for the remaining non-target utterance classes) is selected from the best of the SiDTW and SQD7W scores associated witn non-target utterance classes. [1044] The SI/SD hybrid score SCOMB_H for a particular utterance class is a weighted sum according to EQN. 2 as shown, where SIHMM-r, SIHMMN-r, SiHMMe, and SIDTW© are the same as in EQN. 1. Specifically, in EQN. 2: SIHMMT is the SIHMM matching score for the target utterance class; SIHMMNT is the next best matching score for a template in the SIHMM acoustic model that is associated with a non-target utterance class (an utterance class other than the target utterance class); SIHMMG is the SIHMM matching score for the "garbage" utterance class; DTWT is the best DTW matching score for SI and SD templates corresponding to the target utterance class; DTWNT is the best DTW matching score for SI and SD templates corresponding to non-target utterance classes; and SIDTWG is the SI DTW matching score for the "garbage" utterance class. Thus, the Sl/SD hybrid score SCOMB__W is a combination of individual S! and SD matching scores. The resulting combination matching score does not rely entirely :>n either SI or 3D acoustic models If the matching score S!DTWT is better than any SDDTWT score, then the Sl/SD hybrid score is computed from ;he better SIDTWr score. Similarly, if the matching score SDDTWT is better han any SIDTWT score, then the SI/SD hybrid score is computed from the oetter SDDTWT score. As a result, if the templates in the SD acoustic model yield poor matching scores, the VR system may stilt recognize the input speech oasec cr\ ths SI portions of the SI/SD hybrid scores. Such poor SD matching scores might have a variety of causes including differences between acoustic environments during training and testing or perhaps poor quality input used for training [1045] In an alternate embodiment, the SI scores are weighted less heavily than the SD scores, or may even be ignored entirely. For example. DTWy is selected from the best of the SDDTW scores associated with the target • utterance class, ignoring the SIDTW scores for the target utterance class. Also, DTWNT may be selected frcm the best of either the SIDTW or SDDTW scores associated wit!) non-target utterance classes, instead of using both sets of scores. [1046] Though the exemplary embodiment is described using only SDDTW acoustic models for speaker dependent modeling, the hybrid approach described herein is equally applicable to a VK system using SUMMM acoustic models or even a combination of SDDTW and SDHMM acoustic models. For e>ample, by modifying the approach shown in FIG. 6, the weighting factor Wi could be applied to a matching score selected from the best of SIHMMr and SDHMMT scores. The weighting factor W2 could be applied to a matching score selected from the best of SIHMMNT and SDHMMNT scores. [1047] Thus, disclosed herein is a VR method and apparatus utilizing a combination of SI and SD acoustic models for improved VR performance during unsupervised training and testing. Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or p articles, or any combination thereof. Also, though the embodiments are describee primarily in terms of Dynamic Time Warping (DTW) or Hidden Markov Model (KMM) acoustic models, the described techniques may be applied to other types of acoustic models such as neural network acoustic models. ['1048] Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the rescribed functionality in varying ways for each particular application, but such .implementation decisions should not be interpreted as causing a departure from tie scope of the present invention. [1049} The various illustrative logical blocks, modules, and circuits described h connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FF'GA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration, 11050] Tie steps of a method or algorithm Described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form .of storage medium known-in the art. An exemplary storage medium is coupled to the processor such the processor can read -information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In the alternative, tho processor and the storage medium may reside as discrete components in a user terminal [11)51] The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make cr use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. [1052] WHAT IS CLAIMED IS: CLAIMS 1. A voice recognition appiaratus comprising: j^ speaker independent acoustic model a speaker dependent acoustic model; a voice recognition engine; and s: computer readable madia embodying a method tor performing unsupervised voice recognition training and testing, the method comprising performing pattern matching of input speech with the contents of said speaker independent acoustic model to produce speaker independent pattern matching scores, comparing the speaker independent pattern matching scores with scores associated with tenriplates; scored in said speaker depender\t acoustic rrodel. and updating at least one template in said speaker dependent acousMc nr.odel based on the results of the comparing. 2. rhe voice recognition apparatus of claim 1,^wiierein said speaker iridependant acoustic model comprises at 'eatit one hidden markov model (HMM) acoustic model. 3. Tha voice recognition apparatus of claim 1, wherein said speaker independent acoustic model comprises at Iea^?.t one dynamic time warping piW) acoustic model. A. The voice recognition apparatus of claim 1. wherein said speaker independent acoustic model com.prises at least one hidden markov model (HMM) acoustic model and at least one dynamic time warping (DTW) acoustic model- il Tb& voice recognition apparatus of claim 1, wherein said speaker independent acoustic model: includes at least one garbage template, v^^hereln said comparing includes comparing the input speech to the at least one garbage template. 6. The voice recognition apparatus of claim 1, wherein said speaker def)endent acoustic model compr'^ses at lea&t one dynamic time warping (DTW) acoustic model. 7- A voice recognition apparatus comprising; a Sipeaker independent acoustic model a S\|:»aaker dependent acotiStic njodel; a voice recognition engine; and a computer readable media embodying a method for performing unsupeivised voice recognition training and testing, the method comprising performing pattern matching of a first input speech segment with the contents of . seid speaker independent acoustic model to produce speaker independent pjrtttern matching scores, comparing the speaker independent pattern matching scores w\h scores associated with templates storud in said speaker dependent acoustic modal, updating at least one template in said speaker dependent awustic model based on the results of the comparing, configuring said voic;e re^cognition engine to compare a second input speech segment with the contenti5 of said speaker independertt acoustc model and said speaker dapendent acoustic model to generate at least one combined speaker dspendent and speaker independent matching score, and identifying an utterance class having the best combined ^peak^r dependent and speaker independent matching score. a. The voice recognition apparatus of claim 7, wherein said speaker independent acoustic model con:^piise3 at least one hidden markov model (HMM) acoustic model f. The voice recognition apparatus o1 ciaim 7, wt^erein said speaker independent acousiic model comprises st least one dynamic time warping (DTW) aoouQ^c model 10. The voice recognition apparatus of claim 7, wherein said speaker independent acoustic model comprises at least one hidden markov model HMM) acoiastic modal and at least one dynamic time warping (DTVV) acoubtic lode!. 1. ThD voice rDCognition apparatus of ciairn 7, wherein said speaker lepsndeni acoustic model comprises at least one dynamic time warping (OTW) icoustic nu:idel. 12. A voice recognition appamtus comprising: a speaktir independent acoustic model a speaker dependent acoustic model; a voice recoonltion engine for performing pattern matching of input spfiech with the contents of said speaker independent acoustic model to prciduce speaker independent pattern matching scores and for performing p£i;tern matching of the input speech with the contents of said speaker doDendent acoustic model to produce speaker dependent pattern matching SD:)res. and for generating combined matching scores for a plurality of utterance classes based on the speaker independent pattern matching scores and the speaker dependent pattern matching scores, 13. The voice recognition apparatus of claim 7, wherein said speaker independent acoustic model comprises at least one hidden markov model (hMM) acoustic model. 14. The voice recognition apparatus of claim 7, wherein said speaker indepencent acoustic model comprises at least one dynamic time warping (OTW) acoustic model. 15. The voice recognition apparatus of claim 7, wherein said speaker indeperjdent acoustic model comprises at least one hidden markov rnodei (HMM) acoustic model and at least one dynamic time warping (DTW) acoustic model- 16. The voice recognition apparatus of claim 7, wherein said speaker jepondent acotistic model comprises at least om dynamic time warping (DTW) acoustic rncKlel. 17. A (ftelhod for performina vo^ce recognition comprising: perfaiTTung pattern matching of a first input speech segment with at least one speaker independent acoustic tempiate to produce at least one input pattern matching score; comparing the at least one input pattern matching score with a stored score associated witi"» a stored acoustic template; and replacing the stored acoustic template based on the results of said ■ comparing. 18. Tha method of claim 17 wherein said performing pattern matching further cornpris pedonnning hidden markov mode! (HtvlM) pattern matching of the first -input speech segment with at laast one HMM template to generate at least one HWM matching score; performing dynamic time warping (DTW) pattern matching of the first input speech segment with at least one DTV/ template to generate at least one DTW matching score; and performing at least one weighted sum of said at least one HMM matching soDre and said at least one DTV/ matching score to genf^rate said st least one input pattern matching score. 19. The method of claim 17 further comprising: pwrtorming pattern matching of a second input speech segrr^ent with at least one speaker independent acoustic template to generate at least one speaker Independent matching score; perfomiing pattern matching of the second input speech segment with tfia stored acoustic template to generate a speaker dependent matching score; and combining the at least on»5 ^ipeakor independent matching scora with the speaker clep 20. The /nethod of claim 19 further comprising identifying an utterance class associated with the best of the at least one combined matching score. 21. A niethoj for performing voice recognition comprising: perfornrVmg pattern matching of an input speech segment with at least on€ speaker independent acoLstic template to generate at least one speaker independent matching score; performing pattern matctung of the input speech segn^ent with a speaker dependent acoustic template tc generate at least one speaker dependent ntaSching score; and combining the at least one speaker independent matching score with the at least one speaker dependent matching score to generate at least one combined matching score. 22. A method for oerforming voice recognition comprising; comparing a set of input acoustic feature vectors wiih a speaker independent template in a speaker independent acoustic model to generate a speaker independent pattern matching score, wherein said speaker incependent template Is associated with a first utterance class; con-jparing the set of Input acoustic feature vectors with at least one speaker dependent template in a speaker dependent acoustic model to generate a speaker dependent pattern matching score, wherein said speaker dependent template is associated with said first utterance class; combining said speaker independent pattern matc:hing score with said speaker dependent pattern matching scores to produce a combined pattern matching score; and comparing said combined pattern matching score with at least one other ccrnbined pattern matching score associated with a second utterance class. ?.3, An apparatus for performing voice recognition comprising: means for performing pattern matching of a first input speech segment with at least: one speaker independent acoustic template to produce at least one input pattern matching score; means for comparing the at least one input pattenn matching score wittj a storod score associated with a stored acoustic template; and meanj^ for replacing the stored acoustic template based on the results of said compisring- 24. An apparatus for performing voice recognition comprising: means for perfonning pattern matching of an input speech segment with at teast one speai spociker independent matching score; means for pe/forming pattern matching of the input speech segment with a speaker dependent acoustic template to generate at least one speaker dep means for combining the at least one speaker independent matching score with the at least one speaker dependent matching score to generate at least one combined matching score A voice recognition apparatus substantially as herein described with reference to the accompanying drawings. A method for performing voice recognition substantially as herein described with reference to the accompanying drawings.

Full Text

VOICE RECOGNITION SYSTEM USING IMPLICIT
SPEAKER ADAPTATI ON
BACKGROUND
Field
[1001] The present invention relates to speech signal processing. More
particularly, the present invention relates to a navel voice recognition method
and apparatus for achieving improved performance through unsupervised
training.
Background
[1002] Voice recognition represents one of tb« mast important techniques to endow a machine with simulated intelligence to'recognize user voiced commands and to facilitate human interface with the. machine. Systems that employ techniques ic recover a linguistic: message from an acoustic speech signal are called voice recognition (VR) systems. FIG, 1 shows a basic VR system having a preemphasis filter 102, an acoustic feature extraction (AFE) unit 104, and a pattern marching engine 110. Ths AFE unit 104 converts H series of digital voice samples into a set of measurement values (for example, extracted frequency components) called an acoustic feature vector The pattern matching engine 110 matches a series of accustfc feature vectors with the templates contained in a VR acoustic mode! 112. VR pattern matching engines generally employ either Dynamic Time Warping (DTW) or Hidden Markov Mode! (HMVI) techniques. Both DTW and HMM are well known in the art, and are described in detail in Rabiner, L R. and Juang, B. H-, FUNDAMENTALS OF SPEECH RECOGNITION, Prentice Hall, 1993. When a series of acoustic features matches a template in the acoustic model 112, the identified template is used to generate a desired format of output, such as an identified sequence of linguistic words corresponding to input* speech.
[1003] As: noted above, the acoustic modei 112 is generally either a HMM mode! or a DTW model. A DTW-acoustic model may be thought of as a

date bast? of templates associated with each of trie words that need to be recognizee! In general; a DTW template consists of a sequence of feature vectors that has been averaged over many examples of the associated word. DTW pattern matching generally involves locating a stored template that has minimal distance to the input feature vector sequence representing input speech, A template used in an HMM based acoustic mode! contains a detailed statistical description of the associated speech utterance. In general, a HMW template stores a sequence of mean vectors, variance vectors md a set of transition probabilities. These parameters are used to describe the statistics of a speech unit and- are estimated from many examples of the speech unit HMM pattern matching generally involves generating a probability for each template in the model based on the series of input feature vectors associated with the input speech. The template having the highest probability is selected as the most likely input utterance.
[10(14] Training" refers to the process of collecting speech samples of a particular speech segment or syllable from one or more speakers in order to gerejrate "templates in the "acoustic model 112. Each template in the acoustic model is associated with a particular word 01 speech segmant called an utterance class. There may be multiple templates In the acoustic model associated uith the same utterance class. Testing* refers to the procedure for matching he templates in the acoustic model to a sequence of feature vectors extracted from input speech. The performance of a given system depends largely upon the degree of match between the input speech of the end-user and the contents, of the database, and hence on the match between the reference templates created through training and the speech samples used for VR testing. [1005] The two common types of training are supervised training and unsupervised training- In supervised training, the utterance class associated with each set of training feature vectors is known a priori. The speaker providing the input speech is often provided with a script of words or speech segnents corresponding to the predetermined utterance classes The feature vectors resulting from the reading of the script may then be incorporated into the acoustic model templates associated with the correct utterance classes. [1006] In unsupervised tralnlngi the utterance class associated with a set of training feature vectors is not known a priori. The utterance class must be

correctly Identified before a set of training feature vectors can be incorporated into the correct acoustic model template. In unsupervised training, a mistake in icentlfying the utterance class for a set of training feature vectors can iead to a modifier;ion in the wrong acojstic model template. Such a mistake generally degrades, rather than improves, speech recognition performance. In order to avoid such a mistake, any modification of an acoustic model based on unsupervised training must generally be dene ver/ conservatively, A set of training feature vectors is incorporated into the acoustic model only if there is relatively high confidence that the utterance class has been correctly identified. Such necessary conservatism makes building an SD acoustic model through unsupervised training a very slow process. Until tine SD acoustic model is built in :his way, VR performance wll probably be unacceptable to most users. [1007] Optimally, the end-user provides speech acoustic feature vectors during both training and testing, so that the acoustic model 112 will match strongly with the speech of the end-user. An individualized acoustic model that., is ;ai!ored to a single speaker is also called a speaker dependent (SD) acoustte model. Generating an SD acoustic model generally requires the end-user to provide a large amount of supervised training samples. First, the user must provide training samples for a large variety of utterance classes. Also, in order to achieve the best performance, the end-user must provide multiple templates representing a variety of possible acoustic environments for each utterance class. Because most users are unable or unw King to provide the input speech necessary to generate an 3D acoustic modal, many existing VR systems instead use generalized acoustic models that are trained using the speech of msny "representative" speakers. Such acoustic models are referred to as speaker independent (SI) acoustic models, and are designed to have the best performance over a broad range of users. SI acoustic models, however, may no1.: be optimized to any single user. A VR system that uses an SI acoustic rncde! will not perform as well for a specific user as a VR system that uses an $C acouattc model tailored to that user. For some users, such as those having a strong foreign accents, the performance of a VR system using an SI acoustic model can be so poor that they cannot effectively use VR services at all. [1Q08] Optimally, an SO acoustic model would be generated for each individual user. As discussed above, building SD acoustic models using

supervisee! training is impractical. But using unsupervised training to generate a SD acoustic model can take a long time, during which VR performance based on a partial SD acoustic model may be very poor- There is a need in the art for a VR system that performs reasonably well before and during the generation of an SD acoustic model using unsupervised training.
SUMMARY
[1009] The methods and apparatus disclosed herein are directed to a novel and improved voice recognition (VR) system that utilizes a combination of speaker independent (SI) and speaker dependent (SD) acoustic models. At least one SI acoustic mode,: is used in combination with at ieast one SD acoustic -nodel to provide a level of speech recognition performance that at least egja;s that of a purely SI acoustic model. The disclosed hybrid SI/SO VR system continually uses unsupervised training to update the acoustic templates in the one" or1'more SD acoustic models. The hybrid VR system then uses the updated 3D acoustic models, alone or in combination with the at least one SI acoustic model, to provide improved VR performance during VR testing. [1D10] The word "exemplar/1 is used hsrern to mean "serving as an example, instance, or illustration.n Any embodiment described as an Exemplary embodiment" is not necessarily to be construed as being preferred or advantageous over another embodiment.
BRIEF DESCRIPTION OF THE DRAWINGS
[1011] The features, objects, ana advantages of the presently disclosed
method and apparatus wit* become more apparent from the detailed description
set forth below when taken in conjunction witn the drawings in which like
reference characters identify correspondingly throughout and wherein:
[1012] FIG. 1 shows a basic voice recognition system;
[1013] FIG. 2 shows a voice recognition system according to an exemplary
embodiment;
[1014] FfG, 3 shows a method for performing unsupervised training.

[1015] FIG. 4 shows an exemplary approach to generating a combined
matching score used in unsupeivised training,
[1016] FIG- 5 is a flowchart showing a method for performing voice
recognition (testing) using both speaker independent (SI) and speaker
dependent (SD) matching scores;
[1017] FiG* 6 shows an approach to generating a combined matching score
from both speaker independent (SI) and speaker dependent (SD) matching
scores; and
DETAILED DESCRIPTION
[1018] FIG. 2 shows an exemplary embodiment of a hybrid voice recognition (VR) system as might be implemented within a wireless remote station 202. In an exemplary embodiment, the remote station 202 communicates through a wireless channel (not shown) with a wireless communication network (not shewn), -or example, the remote station 202 may be a wireless phone communicating with a wireless phone system. One skilled in the art will recognize that the techniques described herein may be equally applied to a VR system that is fixed (non-portable) or does not involve a wireless channel. [1019] In the embodiment shown, voice signals from a user are converted intc electrical signals in a microphone (MIC) 210 and converted into digital speech samples in an analog-to-digital converter (ADC) 212. The digital sample stream is then filtered using a preemphasis (PE) filter 214, for example a finite impulse response (FIR) filter that attenuates low-frequency signal components-[1020] The filtered samples are then analyzed in an acoustic feature extraction (AFE) unit 216. The AFE unit 216 converts digital voice samples into accustic feature vectors. In an exemplary embodiment, the AFE unit 216 perforins a Fourier Transform on a segment of consecutive digital samples to generate a vector of signal strengths corresponding to different frequency bins. In an exemplary embodiment, the frequency bins have varying bandwidihs in accordance with a bark scale, In a bark scale, the bandwidth of each frequency bin bears a relation to the center frequency of the bin, such that higher-frequency bins have wider frequency bands than lower-frequency bins. The

bark scale is described in Rabiner, L. R. and Juang, B. H., FUNDAMENTALS OF SPEECH RECOGNITION, Prentice Hall, 1993 and is well known in the art. [1021] In an exemplary embodiment, each acoustic feature vector is extracted from a series of speech samples collected over a fixed time interval. In an exemplary embodiment, these time intervals overlap. For example, acoustic features may be obtained from 20-mlllisecond intervals of speech data beginning evsry ten milliseconds, such that each two consecutive intervals snare a 10-millisecond segment One skilled in the art would recognize that, the tine intervals might instead be non-overlapping or have non-fixed duration without departing from the scope of the embodiments described herein. [1022] The acoustic feature vectors generated by the AFE unit 218 are provided to a VR engine 220, which performs pattern matching to characterize the acoustic feature vector based on the contents of one or more acoustic models 230, 232, and 234.
[1023] In the exemplary-embodiment shown in FIG. 2, three acoustic models are shown: a speaker-independent (SI) Hidden Markov Model (HMM) model • 230, a speaker-independent Dynamic Time Warping (DTW) model 232, and a speaker-dependent (SO) acoustic model 234. One skilled in the art will recognize that different combinations of SI acoustic models may be used in alternate embodiments. For example, a remote station 202 might include just the SiHMM acoustic model 230 and the SD acoustic model 234 and omit the SIDTW acoustic model 232. Alternatively, a remote station 202 might include a single SiHMM acoustic model 230, a SD acoustic model 234 and two different SIDTW acoustic models 232. In addition, one skilled In the ail will recognize that the SD acoustic model 234 may be of the HMM type or the DTW type or a combination of the two. In an exemplary embodiment, the SD acoustic model 234 is a DTW acoustic model.
["024] As described above, the VR engine 220 performs pattern matching to ■dstermine the degree of matching between the acoustic feature vectors and the contents of one or more acoustic models 230, 232t and 234. In an exemplary embodiment, the VR engine 220 generates matching scores based on matching acoustic feature vectors with the different acoustic templates in each of the acoustic models 230,232, and 234. For example, the VR engine 220 generates HMM matching scores based on matching a set of acoustic feature

matching :he acoustic feature vectors with multiple DTW templates in the SIDTW acoustic model 232. The VR engine 220 generates matching scores b&sed on matching the acoustic feature vectors with the templates in the SD acoustic model 234
[1325] As described above, each template in an acoustic model is associated with an utterance class. In an exemplary embodiment, the VR er-gine 220 combines scores for templates associated with the same utterance c:ass to create a combined matching score to be used in unsupervised training. For example, the VR engine 220 combines S1HMM and SIDTW scores obtained from correlating an input set of acoustic feature vectors to generate a combined SI score. Based on that combined matching score, the VR engine 220 determines whether to store the input set of acoustic feature vectors as a 3D template in the SD acoustic model ?34 In an exemplary embodiment, unsupervised training to update the SD acoustic model 234 is performed using exclusively SI matching scores. This prevents additive errors that might otherwise result from using an evolving SD acoustic model 234 for unsupervised training of itself. An exemplary method of performing this unsupervised training is described In greater detail below. [1926] In addition to unsupervised training, the VR engine 220 uses the vsrious acoustic models (230, 232, 234) during testing. In an exemplary embodiment, the VR engine 220 retrieves matching scores from the acoustic models (230, 232, 234) and generates combined matching scores for each utterance class. The combined matching scores are used to select the utierance class that best matches the input speech. The VR engine 220 groups consecutive utterance classes together as necessary to recognize whole words or phrases. The VR engine 220 then provides information about the recognized word o* phrase to a control processor 222, which uses the information to determine the appropriate response to the speech information or command. For example, in response to the recognized word or phrase, the control processor 222 may provide feedback to the user through a display or other user interface. In another example, the control processor 222 may send a message through a wireless modern 218 and an antenna 224 to a wireless network (not

shown), initiating a mobile phone call to a destination phone number associated with the person whose name was uttered and recognized, [1027] The wireless modem 218 may transmit signals through any of a variety of wireless channel types including CDMA, TDMA, or FDMA. In adcition, the wireless modem 218 may be replaced with other types of communications interfaces that communicate over a non-wireless channel without departing from the scope of the described embodiments. For example, the remote station 202 may transmit signaling information through any of a variety of types of communications channel including land-line modems, T1/E1, ISDN, DSL, ethernet, or even traces on a printed circuit board (PCB). [1028] FIG. 3 is a flowchart showing an exemplary method for performing unsupervised training. At step 302, analog speech data is sampled in an analog-to-digital converter (ADC) (212 in FIG. 2). The digital sample stream is then filtered at step 304 using a preemphasis (PE) filter (214 in FIG. 2). At step 305, input acoustic feature vectors are extracted/from the filtered samples in an acoustic feature extraction (AF'E) unit (216 in FIG. 2), The VR engine (220 in FIG. 2) receives the input acoustic feature vectors from the AFE unit 216 and performs pattern matching o? the input acoustic feature vectors against the contents; cf the SI acoustic models (230 and 232 in FIG. 2). At step 308, the VR engine 220 generates matching scores from ths results of the pattern matching. The VR engine 22fl generates SIHMM matching scores by matching the input acoustic feature vectors with the SIHMM acoustic model 230, and generates SIDTW matching scores by matching the input acoustic feature vectors wiih the SIDTW acoustic model 232. Each acoustic template in the SIHMM and SIDTW acoustic models (230 and 232) is associated with a particular utterance class. A: step 310, SIHMM and SIDTW scores are combined to form combined matching scores.
[1029] FIG. 4 shows the generation of combined matching scores for use in unsupervised training, in the exemplary embodiment shown, the speaker independent combined matching score SCOMBRI for a particular utterance class is a weighted sum according to EQN. 1 as shown, where:
SIHMMr is tne SIHMM matching score for the target utterance class;

SIHMMNT is the next best matching score for a template in the SIHMM acoustic model that is associated with a non-target utterance class (an utterance ctess other than the target utterance class); SIHMMe is the SIHMM matching score For the "garbage" utterance class; SlDTWr is the SIDTW matching score for the target utterance class; SIDTWNT is the next best matching score for a template in the SIDTW acoustic model that is associated with a non-target utterance class; and SIDTWo is the SIDTW matching score for tie "garbage" utterance ciass. [1030] The various individual matching scores SIHMMn and SIDTWn may be viewed as representing a distance value between a series of input acoustic feature vectors and a template In the acoustic model. The greater the distance between the input acoustic feature vectors and a template, the greater the matching score. A close match between a template and the input acoustic feature vectors yields a very low matching score* If comparing a series of input acoustic feature vectors to two templates associated with [1032] Before the VR system can confidently recognise an utterance class a:> the "correct" one, the input acoustic feature vectors should have a higher .degree of matching with templates associated with that utterance class than with garbage templates or templates associated other utterance classes. Combined matching scores generated from a variety of acoustic models can rrore confidently discriminate between utterance classes than matching scores based on only one acoustic model. In an exemplary embodiment the VR system uses such combination matching scores to determine whether to

replace a template in the SD acoustic model (234 in FIG. 2) with one derived from a new set of input acoustic feature vectors.
[1033] The weighting factors {Wi .. - W$) are selected to provide the best training performance over all acoustic environments. In an exemplary embodiment, the weighting factors (Wi . . . W6) are constant for all utterance classes. In other words, the Wn used to create the combined matching score for a first target utterance class is the same as the Wn value used to create the combined matching score for another target utterance class. In an alternate embodiment, the weighting factors vary based on the target utterance class. Other ways of combining shown in FIG. 4 will be obvious to one skilled in the ait, and are to be viewed as within the scope of the embodiments described herein. For example, more than six or less than six weighted inputs may also bo used. Another obvious variation would he to generate a combined matching score based on one type of acoustic model. For example, a combined matching score could be-generated based on SIHMMT, SIHMMMT, and SIHMMQ. •Or, a combined matching score could be generated based on SIDTWT, SIDTWNT, and SIDTW3.
[1034] In an exemplary embodiment, Wi and W4 are negative numbers, and a greater (or less negative) value of Scows indicates a greater degree of matching (smaller distance) between a target utterance class and a series of input acoustic feature vectors. One of skill in the art will appreciate that the signs a: the weighting factors may easily be rearranged such that a greater d-sgree of matching corresponds to a lesser value without departing from the scope of Hie disclosed embodiments.
[1035] Turning back to FIG- 3, at step 310, combined matching scores are generated for utterance classes associated with templates in the HMM and DTW acoustic models (230 and 232). In an exemplary embodiment, combined matching scores are generated only for utterance classes associated with the Dsst n SIHMM matching scores and for utterance classes associated with the bast m SIDTW matching scores. This limit may be desirable to conserve computing resources, even though a much larger amount of computing power is consumed while generating the individual matching scores. For example, if n=m=3I combined matching scores are generated for the utterance classes associated with the top three SIHMM and utterance classes associated with the

to 3 three siuiw matching scores. Depending on whether the utterance classes associated with the top three SiHMM matching scores are the same as th* utterance classes associated with the top three SIDTW matching scores, this approach will produce three to six different combined matching scores. [1-336] At step 312, the remote station 202 compares the combined matching scores with the combined matching scores stored with corresponding templates (associated with the same utterance class) in the SD acoustic model. If the new series of input acoustic feature vectors has a greater degree of matching than that of an older template stored in the SD model for the same utterance class, then a new SD template is generated from the new series of input acoustic feature vectors. In an embodiment wherein a SD acoustic model is a DTW acoustic model, the series of input acoustic vectors itself constitutes the new SD template. Ths older template is then replaced with the new template, and the combined matching score associated with the new template is stored in the SD acoustic model to be used in future comparisons.
|1037] ' In an alternate embodiment, unsupervised training is used to update ore or mere templates in a speaker dependent hidden markov model (SDHMM) acoustic model. This SDHMM acoustic model could be used either in place of an SDDTW model or in addition to an SDDTW acoustic model within the SD acoustic mode! 234.
[1038] In an exemplary embodiment, the comparison at step 312 also includes comparing the combined matching score of a prospective new SD te-nplate with a constant training threshold. Even if there has not yet been any te-npfate stored in a SD acoustic model for a particular utterance class, a new template will not be stored In the SD acoustic model unless it has a combined matching score that Is better (indicative of a greater degree of matching) than th«3 training threshold value.
[1339] In an alternate embodiment, before any templates in the SD acoustic model have been replaced, the SD acoustic model is populated by default with templates from the SI acoustic model. Such an initialization provides an alternate approach to ensuring that VR performance using the SD acoustic model will start out at least as good as VR performance using just the S! acoustic model. As more and more of the templates irrthe SD acoustic model

are jpdatscl, the VR performance using the. SD acoustic moaei win surpass VK performance using just the SI acoustic model.
[1040] in an alternate embodiment, the VR system allows a user to perform supervised raining. The user must put the VR. system into * supervised training mode before performing such supervised training. During supervised training, the VR system has a priori knowledge of the correct utterance class. If the combined matching score for ths input speech is better than the combined matching score for the 3D template previously stored for that utterance class, the:i the ir.put speech is used io form a replacement SD template. In an alternate embodiment, the VR system allows the user to force replacement of existing SD templates during supervised training.
[1041] The 3D acoustic model may be designed with room for multiple (two or more) templates for a single utterance class. \r an exemplary embodiment, two templates are stored in the SD acoustic mode! for each utterance class. Th«* comparison at stap 312 therefore entails comparing the matching score obtained with a new tempiats with the matching scores obtained for both templates in the SD acoustic mode! for the same utterance daiss. If the new lenpiate has a better matching score than sither older template in the SD acoustic model, then at step 314 the SD acoustic model template having xhe wcrst matching scoie is replaced with the new template. If the matching score of the ne.w template is no belter than either oider template, then step 314 is skpped. Additionally, at step 312, the matching score obtained with the new template is compared against a matching score threshold. So, until new templates naving a retching score that is belter than the threshold are stored in th* SD sr:oustic (t\cdel the ns-w templates are compared against this threshold value before they will be used to overwrite the prior contents of the SD acoustic model. Obvious variations, such as storing the SD acoustic model templates in sorted order according to combined matching score and comparing new matching scores only with the lowest, ars anticipated and are TO be considered within the scope of the embodiments disclosed herein. Obvious variations on cumbers of templates arored in tho acoustic mode;! for each utterance class are &lso anticipated. For example, the SD acoustic mode! may contain more than two templates for sach utterance class, or may contain different numbers of templates for different utterance classes.

[1042] FIG- 5 is a flowchart showing an exemplary method for performing VR testing using a combination of SI and 3D acoustic models. Steps 302, 304, 30Ci, and 308 are the same as described for FIG* 3. The exemplary method diverge; Ircm the method shown in FIG. 3 at step 510. At step 51 G> the VR engine 220 generates SD matching scores basad on comparing the input acoustic feature vectors with templates in the SD acoustic modei. In an exsmpiary embodiment, SD matching scores are generated only for utterance classes associated with the best n SIHMM matching scores and the best rn SIDTW matching scores. In an exemplary embodiment, n-m-3. Depending on the degree of overlap between the two sets of utterance classes, this will result in generation of SD matching scores for three to six utterance classes. As discussed above, the SD acoustic model may contain multiple templates for a single utterance class. At step 512, the VR engine 220 generates hybrid conbined matching scares for use in VR testing. In an exemplary embodiment, thtsse hybrid combined matching scores are based on both individual SI and ind vidua! 3D matching scores. At step 514, the word or utterance having the beiit combined matching score is selected and compared against a testing threshold. An utterance is on'y deemed recognized if its combined matching sec re exceeds this resting threshold. In an exemplary embodiment, the weights [WM ... Wt5] used to generate combined scores for training (as shown in FIG. 4) are equal ;o the weights [W^ . . . V\fe] used to generate combined scores for testing (as shown in FIG, 6), but ihe training threshold is not equal to the testing
threshold.
[1043] FIG- 6 shows the generation of hybrid combined matching scores performed at step 5^2. The exemplary embodiment shown operates identically to the combiner shown in FIG, 4, except that the weighting factor W.* is applied to OTWT instead of SiDTWr ar»c the weighting factor Ws is applied to DTWNT instead of SIDTVVNT. DTWT (the dynamic time warping matching score for ine target utterance class) is selected from the best of the SIDTW and SDDTW scores associated with the target utterance class. Similarly, DTWNT (the dynamic tins warping matching score for the remaining non-target utterance classes) is selected from the best of the SiDTW and SQD7W scores associated witn non-target utterance classes.

[1044] The SI/SD hybrid score SCOMB_H for a particular utterance class is a weighted sum according to EQN. 2 as shown, where SIHMM-r, SIHMMN-r, SiHMMe, and SIDTW© are the same as in EQN. 1. Specifically, in EQN. 2:
SIHMMT is the SIHMM matching score for the target utterance class;
SIHMMNT is the next best matching score for a template in the SIHMM
acoustic model that is associated with a non-target utterance class (an
utterance class other than the target utterance class);
SIHMMG is the SIHMM matching score for the "garbage" utterance class;
DTWT is the best DTW matching score for SI and SD templates
corresponding to the target utterance class;
DTWNT is the best DTW matching score for SI and SD templates
corresponding to non-target utterance classes; and
SIDTWG is the SI DTW matching score for the "garbage" utterance class. Thus, the Sl/SD hybrid score SCOMB__W is a combination of individual S! and SD matching scores. The resulting combination matching score does not rely entirely :>n either SI or 3D acoustic models If the matching score S!DTWT is better than any SDDTWT score, then the Sl/SD hybrid score is computed from ;he better SIDTWr score. Similarly, if the matching score SDDTWT is better han any SIDTWT score, then the SI/SD hybrid score is computed from the oetter SDDTWT score. As a result, if the templates in the SD acoustic model yield poor matching scores, the VR system may stilt recognize the input speech oasec cr\ ths SI portions of the SI/SD hybrid scores. Such poor SD matching scores might have a variety of causes including differences between acoustic environments during training and testing or perhaps poor quality input used for training
[1045] In an alternate embodiment, the SI scores are weighted less heavily than the SD scores, or may even be ignored entirely. For example. DTWy is selected from the best of the SDDTW scores associated with the target • utterance class, ignoring the SIDTW scores for the target utterance class. Also, DTWNT may be selected frcm the best of either the SIDTW or SDDTW scores associated wit!) non-target utterance classes, instead of using both sets of scores.
[1046] Though the exemplary embodiment is described using only SDDTW acoustic models for speaker dependent modeling, the hybrid approach

described herein is equally applicable to a VK system using SUMMM acoustic models or even a combination of SDDTW and SDHMM acoustic models. For e>ample, by modifying the approach shown in FIG. 6, the weighting factor Wi could be applied to a matching score selected from the best of SIHMMr and SDHMMT scores. The weighting factor W2 could be applied to a matching score selected from the best of SIHMMNT and SDHMMNT scores. [1047] Thus, disclosed herein is a VR method and apparatus utilizing a combination of SI and SD acoustic models for improved VR performance during unsupervised training and testing. Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or p articles, or any combination thereof. Also, though the embodiments are describee primarily in terms of Dynamic Time Warping (DTW) or Hidden Markov Model (KMM) acoustic models, the described techniques may be applied to other types of acoustic models such as neural network acoustic models. ['1048] Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the rescribed functionality in varying ways for each particular application, but such .implementation decisions should not be interpreted as causing a departure from tie scope of the present invention.
[1049} The various illustrative logical blocks, modules, and circuits described h connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array

(FF'GA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such
configuration,
11050] Tie steps of a method or algorithm Described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form .of storage medium known-in the art. An exemplary storage medium is coupled to the processor such the processor can read -information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In the alternative, tho processor and the storage medium may reside as discrete components in a user terminal
[11)51] The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make cr use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. [1052] WHAT IS CLAIMED IS:

CLAIMS
1. A voice recognition appiaratus comprising:
j^ speaker independent acoustic model
a speaker dependent acoustic model;
a voice recognition engine; and
s: computer readable madia embodying a method tor performing unsupervised voice recognition training and testing, the method comprising performing pattern matching of input speech with the contents of said speaker independent acoustic model to produce speaker independent pattern matching scores, comparing the speaker independent pattern matching scores with scores associated with tenriplates; scored in said speaker depender\t acoustic rrodel. and updating at least one template in said speaker dependent acousMc nr.odel based on the results of the comparing.
2. rhe voice recognition apparatus of claim 1,^wiierein said speaker iridependant acoustic model comprises at 'eatit one hidden markov model (HMM) acoustic model.
3. Tha voice recognition apparatus of claim 1, wherein said speaker independent acoustic model comprises at Iea^?.t one dynamic time warping piW) acoustic model.
A. The voice recognition apparatus of claim 1. wherein said speaker independent acoustic model com.prises at least one hidden markov model (HMM) acoustic model and at least one dynamic time warping (DTW) acoustic model-
il Tb& voice recognition apparatus of claim 1, wherein said speaker independent acoustic model: includes at least one garbage template, v^^hereln said comparing includes comparing the input speech to the at least one garbage template.

6. The voice recognition apparatus of claim 1, wherein said speaker def)endent acoustic model compr'^ses at lea&t one dynamic time warping (DTW) acoustic model.
7- A voice recognition apparatus comprising;
a Sipeaker independent acoustic model
a S|:»aaker dependent acotiStic njodel;
a voice recognition engine; and
a computer readable media embodying a method for performing unsupeivised voice recognition training and testing, the method comprising performing pattern matching of a first input speech segment with the contents of . seid speaker independent acoustic model to produce speaker independent pjrtttern matching scores, comparing the speaker independent pattern matching scores w\h scores associated with templates storud in said speaker dependent acoustic modal, updating at least one template in said speaker dependent awustic model based on the results of the comparing, configuring said voic;e re^cognition engine to compare a second input speech segment with the contenti5 of said speaker independertt acoustc model and said speaker dapendent acoustic model to generate at least one combined speaker dspendent and speaker independent matching score, and identifying an utterance class having the best combined ^peak^r dependent and speaker independent matching score.
a. The voice recognition apparatus of claim 7, wherein said speaker independent acoustic model con:^piise3 at least one hidden markov model (HMM) acoustic model
f*. The voice recognition apparatus o1 ciaim 7, wt^erein said speaker independent acousiic model comprises st least one dynamic time warping (DTW) aoouQ^c model
10. The voice recognition apparatus of claim 7, wherein said speaker independent acoustic model comprises at least one hidden markov model

HMM) acoiastic modal and at least one dynamic time warping (DTVV) acoubtic lode!.
1. ThD voice rDCognition apparatus of ciairn 7, wherein said speaker lepsndeni acoustic model comprises at least one dynamic time warping (OTW) icoustic nu:idel.
12. A voice recognition appamtus comprising:
a speaktir independent acoustic model
a speaker dependent acoustic model;
a voice recoonltion engine for performing pattern matching of input spfiech with the contents of said speaker independent acoustic model to prciduce speaker independent pattern matching scores and for performing p£i;tern matching of the input speech with the contents of said speaker doDendent acoustic model to produce speaker dependent pattern matching SD:)res. and for generating combined matching scores for a plurality of utterance classes based on the speaker independent pattern matching scores and the speaker dependent pattern matching scores,
13. The voice recognition apparatus of claim 7, wherein said speaker independent acoustic model comprises at least one hidden markov model (hMM) acoustic model.
14. The voice recognition apparatus of claim 7, wherein said speaker indepencent acoustic model comprises at least one dynamic time warping (OTW) acoustic model.
15. The voice recognition apparatus of claim 7, wherein said speaker indeperjdent acoustic model comprises at least one hidden markov rnodei (HMM) acoustic model and at least one dynamic time warping (DTW) acoustic model-

16. The voice recognition apparatus of claim 7, wherein said speaker jepondent acotistic model comprises at least om dynamic time warping (DTW) acoustic rncKlel.
17. A (ftelhod for performina vo^ce recognition comprising:
perfaiTTung pattern matching of a first input speech segment with at least one speaker independent acoustic tempiate to produce at least one input pattern matching score;
comparing the at least one input pattern matching score with a stored score associated witi"» a stored acoustic template; and
replacing the stored acoustic template based on the results of said ■ comparing.
18. Tha method of claim 17 wherein said performing pattern matching further
cornpris pedonnning hidden markov mode! (HtvlM) pattern matching of the first -input speech segment with at laast one HMM template to generate at least one HWM matching score;
performing dynamic time warping (DTW) pattern matching of the first input speech segment with at least one DTV/ template to generate at least one DTW matching score; and
performing at least one weighted sum of said at least one HMM matching soDre and said at least one DTV/ matching score to genf^rate said st least one input pattern matching score.
19. The method of claim 17 further comprising:
pwrtorming pattern matching of a second input speech segrr^ent with at least one speaker independent acoustic template to generate at least one speaker Independent matching score;
perfomiing pattern matching of the second input speech segment with tfia stored acoustic template to generate a speaker dependent matching score; and

combining the at least on»5 ^ipeakor independent matching scora with the speaker clep 20. The /nethod of claim 19 further comprising identifying an utterance class associated with the best of the at least one combined matching score.
21. A niethoj for performing voice recognition comprising:
perfornrVmg pattern matching of an input speech segment with at least on€ speaker independent acoLstic template to generate at least one speaker independent matching score;
performing pattern matctung of the input speech segn^ent with a speaker dependent acoustic template tc generate at least one speaker dependent ntaSching score; and
combining the at least one speaker independent matching score with the at least one speaker dependent matching score to generate at least one combined matching score.
22. A method for oerforming voice recognition comprising;
comparing a set of input acoustic feature vectors wiih a speaker independent template in a speaker independent acoustic model to generate a speaker independent pattern matching score, wherein said speaker incependent template Is associated with a first utterance class;
con-jparing the set of Input acoustic feature vectors with at least one speaker dependent template in a speaker dependent acoustic model to generate a speaker dependent pattern matching score, wherein said speaker dependent template is associated with said first utterance class;
combining said speaker independent pattern matc:hing score with said speaker dependent pattern matching scores to produce a combined pattern matching score; and
comparing said combined pattern matching score with at least one other ccrnbined pattern matching score associated with a second utterance class.

?.3, An apparatus for performing voice recognition comprising:
means for performing pattern matching of a first input speech segment
with at least: one speaker independent acoustic template to produce at least one
input pattern matching score;
means for comparing the at least one input pattenn matching score wittj a
storod score associated with a stored acoustic template; and
meanj^ for replacing the stored acoustic template based on the results of
said compisring-
24. An apparatus for performing voice recognition comprising:
means for perfonning pattern matching of an input speech segment with
at teast one speai spociker independent matching score;
means for pe/forming pattern matching of the input speech segment with
a speaker dependent acoustic template to generate at least one speaker
dep means for combining the at least one speaker independent matching
score with the at least one speaker dependent matching score to generate at
least one combined matching score*

A voice recognition apparatus substantially as herein described with reference to the accompanying drawings. A method for performing voice recognition substantially as herein described with reference to the accompanying drawings.

Documents:

1539-chenp-2003 abstract duplicate.pdf

1539-chenp-2003 claims duplicate.pdf

1539-chenp-2003 correspondence-others.pdf

1539-chenp-2003 description (complete) duplicate.pdf

1539-chenp-2003 drawings duplicate.pdf

1539-chenp-2003-claims.pdf

1539-chenp-2003-correspondnece-others.pdf

1539-chenp-2003-correspondnece-po.pdf

1539-chenp-2003-description(complete).pdf

1539-chenp-2003-drawings.pdf

1539-chenp-2003-form 1.pdf

1539-chenp-2003-form 3.pdf

1539-chenp-2003-form 5.pdf

1539-chenp-2003-pct.pdf

« Previous Patent

Next Patent »

Patent Number

224885

Indian Patent Application Number

1539/CHENP/2003

PG Journal Number

49/2008

Publication Date

05-Dec-2008

Grant Date

24-Oct-2008

Date of Filing

29-Sep-2003

Name of Patentee

QUALCOMM INCORPORATED

Applicant Address

5775 MOREHOUSE DRIVE, SAN DIEGO, CALIFORNIA 92121-1714,

Inventors:

#	Inventor's Name	Inventor's Address
1	CHANG, CHIENCHUNG	6076 VIA POSADA DEL NORTE, RANCHO SANTA FE, CALIFORNIA 92067,
2	BI, NING	14209 BREEZEWAY PLACE, SAN DIEGO, CALIFORNIA 92128,
3	GARUDADRI, HARINATH	9435 OVIEDO STREET, SAN DIEGO, CALIFORNIA 92129,
4	MALAYATH, NARENDRANATH	10710 SABRE HILL DRIVE #229, SAN DIEGO, CALIFORNIA 92128,
5	DEJACO, ANDREW, P	9705 CAMINTO MOJADO, SAN DIEGO, CALIFORNIA 92131,
6	JALIL, SUHAIL	10380 MAYA LINDA ROAD, APT. C-110, SAN DIEGO, CALIFORNIA 92126,

PCT International Classification Number

G10L5/00

PCT International Application Number

PCT/US02/08727

PCT International Filing date

2002-03-22

PCT Conventions:

#	PCT Application Number	Date of Convention	Priority Country
1	09/821,606	2001-03-28	U.S.A.