Title of Invention

SYSTEM AND METHOD FOR TEXT-TO-SPEECH PROCESSING IN A PORTABLE DEVICE

Abstract A system and method for providing high-quality text-to-speech (TTS) output in a low-complexity device is disclosed. TTS output is generated by a TTS system that resides on a high-complexity device. The TTS output is transmitted from the high-complexity device to the low-complexity device for subsequent retrieval and playback.
Full Text

BACKGROUND
Field of the Invention
[0001] The present invention relates generally to text-to-speech processing and more particularly to text-to-speech processing in a portable device.
Introduction [0002] Text-to-speech (TTS) synthesis technology gives machines the ability to convert arbitrary text into audible speech, with the goal of being able to provide textual information to people via voice messages. These voice messages can prove especially useful in applications where audible output is a key form of user feedback in system interaction. These situations arise when the user is unable to appreciate textual output as an effective means of responsive communication. In that regard, it is believed that TTS technology can provide promising benefits when used as a mechanism for communicating to users of handheld portable devices (e.g., cell phones, personal digital assistants, etc.). [OOOS] Handheld portable device designs are typically driven by the ergonomics of use. For example, the goal of maximizing portability has typically resulted in small form factors with minimal power requirements. These constraints have clearly lead to limitations in the availability of processing power and storage capacity as compared to general-purpose processing systems (e.g., personal computers) that are not similarly constrained.

[0004] Limitations in the processing power and storage capacity of handheld portable designs have a direct impact on the ability to provide acceptable TTS output. Currently, these limitations have dictated that only low-quality TTS technology could be used. What is needed therefore is a solution that enables an application of high-quality TTS technology in a manner that accommodates the limitations of current handheld portable devices.
BRIEF DESCRIPTION OF THE DRAWINGS
[0005] In order to describe the manner in which the above-recited and other advantages
and features of the invention can be obtained, a more particular description of the invention
briefly described above will be rendered by reference to specific embodiments thereof which
are illustrated in the appended drawings. Understanding that these drawings depict only
typical embodiments of the invention and are not therefore to be considered to be limiting of
its scope, the invention will be described and explained with additional specificity and detail
through the use of the accompanying drawings in which:
[0006] FIG. 1 illustrates an embodiment of a text-to-speech processing environment in
accordance with the present invention; and
[0007] FIG. 2 illustrates an embodiment of a text-to-speech component in a mobile
computing device.

DETAILED DESCRIPTION
[0008] Various embodiments of the invention are discussed in detail below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations may be used without parting from the spirit and scope of the invention.
[0009] Text-to-speech (TTS) synthesis technology enables electronic devices to convert a
stream of text into audible speech. This audible speech thereby provides users with textual
information via voice messages. TTS can be applied in various contexts such as email or any
other general textual messaging solution.
[0010] As would be appreciated, the quality of TTS synthesized speech is of critical
importance in the increasingly widespread application of the technology. Mobile devices
such as phones and personal digital assistants are particularly suitable for leveraging TTS
technology.
[0011] Several different TTS methods for synthesizing speech exist, including
articulatory synthesis, formant synthesis, and concatenative synthesis methods.
[0012] Articulatory synthesis uses computational biomechanical models of speech
production, such as models for the glottis (that generates the periodic and aspiration
excitation) and the moving vocal tract. Ideally, an articulatory synthesizer would be
controlled by simulated muscle actions of the articulators, such as the tongue, the lips, and
the glottis. It would solve time-dependent, three-dimensional differential equations to

compute the synthetic speech output. Unfortunately, besides having notoriously high computational requirements, articulatory synthesis also, at present, does not result in natural-sounding fluent speech.
[0013] Formant synthesis uses a set of rules for controlling a highly simplified source-filter model that assumes that the (glottal) source is completely independent from the filter (the vocal tract). The filter is determined by control parameters such as formant fi-equencies and bandwidths. Each formant is associated with a particular resonance (a "peak" in the filter characteristic) of the vocal tract. The source generates either stylized glottal or other pulses (for periodic sounds) or noise (for aspiration or frication). Formant synthesis generates highly intelligible, but not completely natural sounding speech. However, it has the advantage of a low memory footprint and only moderate computational requirements.
[0014] Finally, concatenative synthesis uses actual snippets of recorded speech that were cut from recordings and stored in an inventory ("voice database"), either as "waveforms" (uncoded) or encoded by a suitable speech coding method. Elementary "units" (i.e., speech segments) are, for example, phones (a vowel or a consonant) or phone-to-phone transitions ("diphones") that encompass the second half of one phone plus the first half of the next phone (e.g., a vowel-to-consonant transition). Some concatenative synthesizers use so-called demi-syllables (i.e., half-syllables; syllable-to-syllable transitions), in effect, applying the "diphone" method to the time scale of syllables. Concatenafive synthesis itself then strings together (concatenates) units selected fi-om the voice database and, after optional

decoding, outputs the resulting speech signal. Because concatenative systems use snippets of recorded speech, they have the highest potential for sounding "natural". [0015] Conventional applications of TTS technology to low complexity devices (e.g., mobile phones) have been forced to trade off quality of the TTS synthesized speech in environments that are limited in its processing and storage capabilities. More specifically, mobile devices are typically designed with much lower processing and storage capabilities as compared to conventional desktop or laptop personal computing devices. This results in the inclusion of low-quality TTS technology in mobile devices. For example, conventional applications of TTS technology to mobile devices have used formant synthesis technology, which has low memory footprint and only moderate computational requirements. [0016] hi accordance with the present invention, high-quality TTS technology is enabled even when applied to devices (e.g., mobile devices) that have limited processing and storage capabilities. Principles of the present invention will be described with reference to FIG. 1, which illustrates the application of high-quality TTS technology to a mobile phone, hi the following description, the high-quality TTS technology is exemplified by concatenative synthesis technology. It should be noted, however, that the principles of the present invention are not limited to concatenative synthesis technology. Rather, the principles of the present invention are intended to apply to any context wherein the TTS technology is of a complexity that cannot practically be applied to a given device.
[0017] In one example mobile phone application, TTS technology can be used to assist voice dialing. In general, voice dialing is highly desirable whenever users are functionally blind or handicapped, such as is the case when a user is driving a car. Saying "Call John at

work" is certainly safer than attempting to dial a 10-digit string of numbers into a miniature dial pad while driving.
[0018] Voice dialing and comparable command and control are made possible by automatic speech recognition (ASR) technology that is available in low-footprint ASR engines. The low memory footprint allows ASR to run on the device itself. [0019] While voice dialing can increase personal safety, the voice dialing process is not entirely free from distraction. In some conventional phones, voice dialers provide feedback (e.g., "Do you mean John Doe or John Miller?") via text messages or low-quality TTS. [0020] For high quality (natural-sounding, intelligible) rendering of feedback messages via synthetic speech, the latest TTS technology is needed. Ideally, the TTS module would also run on the device and provide the feedback to the user to ensure that the ASR engine correctly interpreted the voice input. As noted, however, current high-quality TTS requires a greater level of processing and memory support than is available on many current devices, hdeed, it will likely be the case that the most current TTS technology will almost always require a higher level of processing and memory support than is available in many devices. [0021] As will be described in greater detail below, the present invention enables high-quality TTS to be used even in devices that have modest processing and storage capabilities. This feature is enabled through the leveraging of the processing power of additional devices (e.g., desktop and laptop computers) that do possess sufficient levels of processing and storage capabilities. Here, the leveraging process is enabled through the communication between a high-capability device and a low-capability device.

[0022] FIG. 1 illustrates an embodiment of such an arrangement. As illustrated in FIG. 1, ITS environment 100 includes computer 110, mobile phone 120, and user 130. Here, computer 110 and mobile phone 120 can be designed to communicate as part of a synchronization process. This synchronization process allows user 130 to ensure that a database of information (e.g., calendar, contacts/phonebook, etc.) on computer 110 is in sync with the database of information on mobile phone 120. As would be appreciated, modifications to the general database of information (e.g., generating a new contact, modifying existing contact information, etc.) can be made either through the user's interaction with computer 110 or with mobile phone 120.
[0023] It should be noted that the synchronization of information between computer 110 and mobile phone 120 can be implemented in various ways. In various embodiments, wired connections (e.g., USB coimection) or wireless connections (e.g., Bluetooth) can be used. Various synchronization software can also be used to effect the synchronization process. Current examples of available synchronization software include HotSync by Palm, Inc. and iSync by Apple Computer, Inc. As would be appreciated, the principles of the present invention are not dependent upon the particular choice of connection between computer 110 and 120 or the particular synchronization software that coordinates the exchange. [0024] In general, the synchronization process provides a structured manner by which high-quality TTS information can be provided to mobile phone 120. In an alternative embodiment, a dedicated software application can be designed apart from a third-party synchronization software package to accomplish the intended purpose. With this communication conduit, the TTS system in mobile phone 120 can leverage the processing
B

and storage capabilities within computer 110. More specifically, in the context of a concatenative synthesis technique, the processing and storage intensive portions of the TTS technology would reside on computer 110. An embodiment of this structure is illustrated in FIG. 2.
[0025] As illustrated in FIG. 2, computer 110 includes TTS system 210. TTS system 210 is a concatenative synthesis system that includes text analysis module 212 and speech synthesis module 214. Text analysis module 212 itself can include a series of modules with separate and intertwined functions. In one embodiment, text analysis module 212 analyzes input text and converts it to a series of phonetic symbols and prosody (fundamental frequency, duration, and amplitude) targets. While the specific output provided to speech synthesis module 214 can be implementation-dependent, the primary function of speech synthesis module is to generate speech output. This speech output is stored in speech output database 220.
[0026] The TTS output that is stored in speech output database 220 represents the result of TTS processing that is performed entirely on computer 110. The processing and storage capabilities of mobile phone 120 have thus far not been required. [002 7] hi one embodiment, TTS system 210 can be used to generate presynthesized speech output for both carrier phrases and slot information. An example of a carrier phrase is "Do you want me to call [slotl] at [slot2]?" Here, slotl (e.g., name) and slot 2 (e.g., location such as "at work") represent audio fillers for the carrier phrases. The presynthesized carrier phrases and slot information are downloaded to mobile phone 120 for subsequent playback.

[0028] In an embodiment that leverages synchronization software, the slot information can be downloaded to mobile phone 220 as another data type of a general database that is updated during the synchronization process. For example, slot information dedicated for names can be included as a separated data type for each contact record in a user's address/phone book.
[0029] The provision of carrier phrases and slot information to mobile phone 120 enables the implementation of a simple TTS component on mobile phone 120. This TTS component can be designed to implement a general table management function as it matches particular carrier phrases with the appropriate slot information. A small code footprint therefore results.
[0030] In one embodiment, the presynthesized carrier phrases and slot information is downloaded in coded (compressed) form. While the transmission of compressed information to mobile phone 120 will certainly increase the speed of transfer, it also enables further simplicity in the implementation of the TTS component on mobile phone 120. More specifically, in one embodiment, the TTS component on mobile phone 120 is designed to leverage the speech coder/decoder (codec) that already exists on mobile phone 120. By presynthesizing and storing the speech output in the appropriate coded format, mobile phone 120 can then be designed to pass that information through the existing speech codec of mobile phone 120, thereby effectively providing TTS playback by "faking" the playback of a received phone call. This embodiment serves to significantly reduce implementation complexity.
h

[0031] While the invention has been described in detail and with reference to specific embodiments thereof, it will be apparent to one skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope thereof Thus, it is intended that the present invention covers the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.
//


WE CLAIM:
1. A method in a speech processing system that includes a speech synthesis module
that enables a conversion of text into synthesized speech, said method comprising the
steps of:
(1) receiving presynthesized slot information as part of a synchronization process with a computing device, wherein said slot information represents a value of a defined data type in a user record on said computing device, said slot information being designed for inclusion at a predefined posifion within a carrier phrase;
(2) storing said presynthesized slot information in a memory; and
(3) reproducing said carrier phrase and said presynthesized slot information as audible output for a user.

2. The method of claim 1, comprising receiving a presynthesized carrier phrase.
3. The method of claim 1, wherein the speech processing system is a personal computer.
4. The method of claim 1, wherein said receiving comprises receiving via a wired link.
5. The method of claim 1, wherein said receiving comprises receiving via a wireless link.
6. The method of claim 1, wherein said carrier phrase and said presynthesized slot information is compressed and wherein said reproducing comprises passing said carrier phrase and said presynthesized slot information through a codec.
12

7. The method of claim 1, wherein said receiving comprises receiving the
presynthesized slot information from a personal digital assistant.
8. The method of claim 1, wherein said receiving comprises receiving the
presynthesized slot information as part of a synchronization process.
9. The method of claim 1, wherein said receiving comprises receiving presynthesized
carrier segments and presynthesized slot segments.
10. The method of claim 1, wherein said slot information is one of a name, a number,
and location.
11. A computing device configured to perform the method claimed in any one of the
preceding claims.

13

Documents:

3040-chenp-2005 abstract duplicate.pdf

3040-chenp-2005 abstract.pdf

3040-chenp-2005 assignment.pdf

3040-chenp-2005 claims duplicate.pdf

3040-chenp-2005 claims.pdf

3040-chenp-2005 correspondence others.pdf

3040-chenp-2005 correspondence po.pdf

3040-chenp-2005 descrption (complete) duplicate.pdf

3040-chenp-2005 descrption (complete).pdf

3040-chenp-2005 drawing duplicate.pdf

3040-chenp-2005 drawings.pdf

3040-chenp-2005 form-1.pdf

3040-chenp-2005 form-18.pdf

3040-chenp-2005 form-26.pdf

3040-chenp-2005 form-3.pdf

3040-chenp-2005 form-5.pdf

3040-chenp-2005 pct search report.pdf

3040-chenp-2005 pct.pdf

3040-chenp-2005.tif


Patent Number 232225
Indian Patent Application Number 3040/CHENP/2005
PG Journal Number 13/2009
Publication Date 27-Mar-2009
Grant Date 16-Mar-2009
Date of Filing 17-Nov-2005
Name of Patentee AT & T CORP
Applicant Address 32 Avenue of the Americas, New York, NY 10013-2412,
Inventors:
# Inventor's Name Inventor's Address
1 SCHROETER, Horst, Juergen 58 Commonwealth Ave., New Providence, NJ 07974,
PCT International Classification Number G10L 01/00
PCT International Application Number PCT/US04/11654
PCT International Filing date 2004-04-15
PCT Conventions:
# PCT Application Number Date of Convention Priority Country
1 60/463,760 2003-04-18 U.S.A.
2 10/742,853 2003-12-23 U.S.A.