Title of Invention

METHOD FOR GENERATING REPROCESSABLE SIGNALS FOR SIMILAR SOUNDING WORDS IN LANGUAGES SPOKEN IN INDIA

Abstract We claim, 1. A method for performing a phoneme based character transformation of textual information based on locale specific information comprising : i) Providing a phonetic transformation structure comprising multidimensional arrays storing languages specific phonetic structures mapped to basic Latin phonetic structure in a microprocessor based processing unit. ii) receiving input initiating phonic character transformation be performed on an input character and / or character sequence, said input character / character sequence comprising of a Latin / non-Latin character(s) having at least one byte or 7-bit value; iii) mapping said input character / character sequence into the multidimensional array storing the language specific information; iv) transforming the said Latin phonetic character / character sequence into phonic transformation values for said input character / character sequence for use by known methods
Full Text FORM 2 '
THE PATENTS ACT : 1970
(39 OF 1970)
COMPLETE SPECIFICATION SECTION: 10
METHOD FOR GENERATING REPROCESSABLE SIGNALS FOR SIMILAR SOUNDING WORDS IN LANGUAGES SPOKEN IN INDIA
CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING
(CDAC )
A Scientific Society of Department of Information technology under the Ministry of Communications and Infonnation Technology,Registered under the provisions of Societies Registration Act, having its Head Office at
Pune University Campus, Ganeshkhind Road, Pane 411 007, Maharashtra State
The following specifications particularly describe the nature of this invention and the manner in which it is to be performed.


FIELD OF THE INVENTION
The present system aims at developing an effective method and system for processing inputs from variety of languages for further uses..
BACKGROUND OF INVENTION
For the information technology revolution to reach the masses, the software manufacturers, and technology developers, need to address to the local and cultural specific information of the users. Traditionally, however, software products have been written, developed and deployed only in Latin scripts speciaUy English. As the penetration of the information increases, a need to ^dr^s a more wide range of users has become inevitable. The need for multilingual content creation, application development and deplo5mient has become one of the major area of Technology development.
The number of languages and dialects spokai in India are about 6,473. The scripts of various languages are also different. As such, Data Processing in Indian languages pr^ents an uphill task. Several difBcidties are encountered in case of translation and transliteration etc.
There exist huge databases of information in various Indian languages which cannot be ignored in the present age of Information. However the present systems for utilising the databases in Indian Languages are not very satisfactory.


In the past, the process of providing National Language Support (Le., accommodating a specific country's language, conventions, and culture) was done on more or less ad hoc basis .The process essentially involved retrofiting the technology to accommodate a particular locale. Merely separating the text in a user interfece of the program is not an acceptable solutioa Even after translating messages, help messages, and other textual information to the target language, one still has to address basic issues of displaying and printing characters in the target language.
DRAWBACKS OF THE PRESENT TECHNOLOGY
This becomes more conplex with a special context of Indian Language support. For instance, Indian language codes were not included in the character sets, wdiich are defined by the defeult character, sets as provided by the conputer operating systems. ISCII is the Indian Language character set for storing, searching, sorting, indexing and editing Indian Languages on Ihe computer systems.
Another problem which exists is that aport fi-om the character set encoding, various users use different encoding mechanism, like glyph based encoding, UNICODE encoding, double font encoding etc. Various non-standard and proprietary encoding mechanisms do exist in parallel. The "coded" information comprises the set of numeric codes employed to represent that

infommtion set. The actual numeric value used to represent a particular character may, in feet, vary from one encoding mechanism to another.
Imagine a situation where the electoral rolls of a country like India are to, be put in form of a database. This database will have all the information of each voter in the country, entered in the langue of his local or constituency. In a country like India where there are 10 (ten) official language and more than sixteen hundred variants of languages, it will be practically impossible to search, sort or index the data as the number of languages increase. Fvirther it is very difficult to search data if the input is given in one language and the results are to be searched in the data stored in other language.
Till date the efforts made to address this sort of problems were mostly specific to a particular situation. Such an approach in itself is problematic. Further such



an approach is no inter portable and interpretable across various heterogeneous
systems. Even if a system atten^ts to maintain multiple local specific information's , for searching and sorting the approach is inefficient. It is an inefficient use of resources (e.g. system memory usage, storage space, processing power etc.) to maintain language specific information for each individual language.
Soundex system, a method for coding words, so that words that sound alike have the same t»de. According to Don Knuth in The Art of Conqjuter Programming~Vol. 3: Sorting and Searching, the Soundex method was originally developed by Margaret Odell and Robert Russell and was patented

(U.S. Pat. Nos. 1,261,167 (1918) and 1,435,663 (1922)). The said patents
disclose a method for assigning walues to various letters which depend on the
type of latin character and which can be used to search database .But this
system has drawbacks, as it works only for English.
The present invention, not only ameliorates these drawbacks but also provided,
better hardware amalgamated support to generate faster reprocessable phonetic
weights, which cem be used for various advanced processing by otiier hardware
devices.
What is needed is a system providing international language support in
application programs, which are portable and yet flexible. Such a solution
should be suited for multiple platforms, yet be able to search sort and index,
language specific and local specific informatioa Such a system should be able
to handle multiple languages spoken in India simultaneously.
The present invention fulfils this and other needs.
GLOSSARY
ASCII: American Standard code of Information Interchange a 7 bit code.
ISCn: Indian Standard Code for Information Interchange; a sequence of 128
standard characters fitted in 8 bit space.
Code page: A character set, such as available in MS-DOS versions 3.3 and
later, ihat provides a table for relating the binary character codes used by a
program to keys on a keyboard or to the appearance of characters on a display.

Database: An organized collection of information
Enabling or Internationalization: Designing and coding a product, so that it can be made to fimction for international use. A product is enabled if a national language version can be created at ninimal expense and if it does not interfere with current or planned national language support of other products.
Index: A file that determines an order in v^ch the system can access the records in a table.
National Language: A language or dialect spoken ty any group of people in a country.
Unicode: A particular 16-bit character set, as defined by the Unicode Consortium The term "Unicode," when used generally herein, refers to an encoded representation of a character in the Unicode character set; the encoding is fixed two bytes in laigth, with a variable-width encoding known as "UTF-8" (8-bit Unicode Transformation Format) available vMch may vary from one to three bytes in length. Different formats are available. One standard, ISO 10646, defines an international standard representation of Unicode.

SUMMARY OF THE INVENTION
This invention relates to generation of reusable phonetic weights that can be used to search, sort and index data across various Indian languages with greater accuracy and efficiency. This invention deals with a system, that comprises of a standard data processing unit comprising of a central processing unit, keyboard, mouse and processing unit consisting of a Microprocessor and or a Micro¬controller coupled with a unit incorporating a unique micro code. The Microprocessor based processing unit (MPU) referred above, identifies inputs through any standard data entry devices such as a keyboard or an electronically generated or received data
The processing unit acquires the natural language data, encoded in binary format, and processes it based on a set of steps to generate a set of language independent digital values or phonetic weights. These weights, which are in a form "letter, digit, digit, digit" can be used to search, sort and index the data already stored in the processing units primary or secondary storage devices, or on a network storage device or devices.
The invention now will be described in detail with the help of drawing accompanying tiiis specification v^erein the details are indicated.


Brief Description of the Diagrams:
Figure 1 shows the general schematic block diagram of the Microprocessor-Based System ('MBS')
(1) in Fig.lshows various inputs to (2) 'MBS'. The inputs can come from various deArices namely Keyboard, Speech Recognition device. Image and/or Transliteration Unit capable of transliterating digital signals from one language to another language.
(2) in Fig.l is a representation of Microprocessor-Based System with Index Generation Unit ('IGU').
(3) in Fig.l shows re-processable signals generated by the MBS.
Figure 2 is a schematic diagram showing (4) MPU interacting/communicating with (5) IGU.
Figure 3(A) shows tihie construction of the first module of Index Generator Unit ('IGU').
(6) in Figure 3(A)shows the System Bus.
(7) in Figure 3(A) shows tiie logical End Gates of the Index Generator Unit.
(8) in Figure 3(A) shows pathways for the output of the IGU.
Figure 3(B) shows second module of Index Goierator Unit using very large scale integration for tiie VLSI implementation of Ihe IGU.
(8) in Figure 3(B)shows the Address Bus.

DESCRIPTION OF INVENTION
The following description will focus on the presaitfy preferred embodiment of the present invention, which is operative in a searching, sorting and indexing environment, executing multilingual applications.
The present invention, however, is not limited to any particular ^plication or environment Instead, those skilled in the art will find that the present invention may be advantageous^ applied to any application or environment where optimization of performance is desirable, database management systems, internet tools and the like. The description of the exemplary embodiments, which follows, is therefore, for the purpose of illustration and not limitatioa
The input (1) is obtained from user-specific peripherals. These peripherals can be as simple as a keyboard giving input to the MBS (2) or it can be a device such as Speech Recognition Unit or an optical character reader capable of inputting an image.
In another variation a transliterator capable of tiansliterating characters even from non-Indian languages can be used for creating reprocessable signals for user-specific requirements.

10
The multilingual data, vMch will be forwarded to MBS(2) and , which the MBS (2)take as input, may follow a proprietary / other coding standard. Various coding mechanism like Glyph / font based coding, UNICODE encoding, ISO encoding, ISCn etc. exist. The data which is encoded using any encoding mechanism will be first converted to a common character encoding standard (like ISCn, UNICODE etc.) , using a set of language specific rules, incorporated in the circuit of current embodiment in fig.(l).
These rules are encoded using logic gate arr^s or encoded on a silicon chip, which are specific for converting the data fi-om a particular type of encoding mechanism to ISCn. This conversion mechanism will take care of the language or languages, their attributes, punctuation marks, diacritic marks and other lexical information relevant to the language or languages embedded within the input.
The inputs obtained are thus converted into ISCn outputs in (1) . The ISCn data which will be used in the MBS (2) will adhere to the ISCn 91 Standard. The ISCn 91 standard, so specified, is for the present embodiments and it is not a limitation on the invention. The Microprocessor-Based System (2) can be extended to all forms of ISCn and its variants. The present manifestation is stand-alone in nature though, it is extendable to multi-platform and/or multi¬processor and/or multiple systems and/or network environment.

11 MICROPROCESSOR BASED SYSTEM (2)
The Microprocessor-Based System(2) consists of two componaits.
TTie first cononent is the MPU or the Micro Processor Unit. The MPU (4) through the Bus interacts or communicates with the Index Generator Unit (5).
The input received by the MBS is utilised in the MPU interacting through a Bus with the Index Generating Unit or IGU (5). For the sake of currait embodiment ISCn 91 character standard, is considered as input to the MPU, though this is not a limitation and is extendable for other encoding mechanisms.
The data encoded in ISCn 91, which the system will take as an Input from step I, as in this case, will be passed to the IGU(5). The data is converted to a common ASCII character set in IGU(5). This is achieved using a set of language specific logic, incorporated in the circuit of the current embodimait. The circuit en5)loys logic gate arra)^ (3 A) encoded on a silicon chip. The said silicon chip will then pass over this data to the next module (3B) in IGU (5) over the IGU/system bus. For exanple "cff' will be mapped to Latin 'lea" and
so oa In case of nasal sounds / vowel modifies v^iiich are specific peculiarity of Indian characters, the character following the nasal sound governs the moping in the multidimensional array. For exanple M>s the last nasal sound of ^ group or W^ = W^ where ^ is the last nasal
sound of '5 group. This will fecilitate mapping of phonemes across various

languages spoken in India

12 This mechanism will also overcome the drawbacks due to nasals and will allow smooth transition across languages and facilitate phonetic compatibility wifli the existing Latin system.
The output sequence of Latin characters as generated above from module (3 A) will be used as the input for (SB) of IGU (5). These character sequences will be analyzed till the end and phonetic weights corresponding to it will be generated in tiie "letter, digit, digit, digit" format.
The digital signals received from the above (3A) module and IGU (5) would be passed using following steps in (3B) of IGU.
1. The apparatus stores Ihe digital values corresponding to the first bit packet (depending on the underlying character encoding eitiier an English equivalent or Indian language equivalent can be stored) in the device RAM, as this value would used as input in a later stages.
2. Digital signals that are received are classified using English alphabets for simplicity of representation.
3. Further processing is by known methods as follows.
4. The apparatus would drop all the occurrences of signals which correspond to English characters "A", "E", 'T', "0","U", "W" and "Y" in otiier positions.


13
5. If the apparatus finds repetitions in Ihe digital signals only one signal is taken as input and remaining repeating patters are omitted.
6. For Ihe digital signals that occur in the second packet of bits tiie apparatus assigns different values. These values are determined by the arrangement of windowed logic gates. To illustrate this -
a. If the bit packet matches with one of the bit pattems
of B/F/PIY - the value of 1 is stored in the RAM
location allocated for output buffer
b. If the bit packet matches with one of the bit pattems
of C/G/J/K/Q/S/X/Z - flie value of 2 is stored in flie
RAM location allocated for output buffer
c. If tile bit packet matches with one of flie bit pattems
of D or T - the value of 3 is stored in Ihe RAM
location allocated for output buffer
d. If flie bit packet matches with flie bit pattems of L -
the value of 4 is stored in the RAM location allocated
for output buffer
e. If the bit packet matches with one of the bit pattems
of M or N - the value of 5 is stored in flie RAM
location allocated for output buffer

13A f. If the bit packet matehes wifli the bit patterns of R -the value of 6 is stored in tfie RAM location allocated for output buffer
7. The apparatus performs fee process in (4) firom 2"' to 4* a^ae bit patterns
8. After performing the above operation the apparatus collates fee ^fei stored in RAM from fee locations of out put buffer ^d compiles fee resultant phonetic weight 4 byte format, comprising of data stored in RAM in step (1) at first location and fee balance bits acquired from step (5) placed after it in fee Qrder of occurrence.
9. Any digital signals which do not match wife fee signals as stated in (4) above are ignored by fee apparatus.
10. The resultant of feis process is a 4 byte code wiiich is represented in "Letter, digit, digit, digit" format
11. In fee present embodiment fee apparatus is a DSP processor designed on a very large scale integration (VLSI) chip.
ADVANTAGES OF CURRENT EMBODIMENT
Since fee complete embodiment is embedded on the silicon chip, it will not depend on ofeer system peripherals, vsiiich affect fee performance of fee processing.


14
2. This invention will be platform independent and operating system independent. This embodiment can be plugged on any network working with any database.
3. This invention would be microprocessor processor independent, as IGU acts as a standalone separate mat.
REPROCESSIBILITY OF MPU OUTPUT
The phoneme weights generated as described hereabove which are enable of utilisation for various purposes , will achieve search results, perform indexing of the data or perform sorting of data.
a. The phonetic weights that are generated by the process as
described , can be used by known methods as a basis for
performing searching, sorting and indexing of data in various
applications, stand alone or network based. This will increase
the speed of operation by narrowing down the scope of
search to a select few, relevant queries. This could be useful
in situations of dual representations of words (same word
spelled in more than one ways). In places where we have
spelling mistakes, but near the same pronunciation of
spellings.
b. These values can also be used to find similar sounding
words, and these words can be used as suggestions in the
suggestion list, in case of spell checkers. These weights can

15
also be used to find similar sounding words, firom the data base.
In this context the present invention has distinct advantage over the existing systems. Hie Soundex method has severe limitations in that it can proems only inputs fi-om English language. As against this the inputs fi"om non-English language such as French, German or Italian cannot be processed by the Soundex method. The Sanskrit which is expressed through the Devanagari Script is considered to be the most scientific language capable of incorporating different pronunciations and/or nuances of pronunciations of non-English languages. TTie present invoition of the petitioners has distinct advantage in that in case of all non-English languages by using a transliterator it will result in ISCn inputs for the Microprocessor-Based System (2). The reprocessable signals would be obtained even in non-English language.

16 We claim,
1. A method for performing a phoneme based character transformation of textual information based on locale specific information comprising :
i) Providing a phonetic transformation structure comprising multidimensional arrays storing languages specific phonetic structures mapped to basic Latin phonetic structure in a microprocessor based processing unit.
ii) receiving input initiating phonic character transformation be performed on an input character and / or character sequence, said input character / character sequence comprising of a Latin / non-Latin character(s) having at least one byte or 7-bit value;
iii) mapping said input character / character sequence into the
multidimensional array storing the language specific information;
iv) transforming the said Latin phonetic character / character sequence into phonic transformation values for said input character / character sequence for use by known methods

17
2. A method as claimed in claim 1 wherein a glyph based encoding is
converted to an ISCE and/ or UNICODE encoding standard comprising
I. Searching the data based on the phonic weights generated using the
methodology described above in Claim 1 and n. Sorting the data based on the phonic weights generated using tiie
methodology described above in Claim 1 and in. Indexing of data based on the phonic weights generated using the
methodology described above in Claim 1.
3. A method as claimed in Claim 1 , wiierein the phonic weights generated, do spelling correction(s) and / or generate a Ust of suggestions.
4. A method as claimed in 1,2 and 3 wherein phonic weights are generated for a variety of Non- English languages.
dated this the 26 day of February ,2003


Documents:

73-mum-2002-cancelled pages(14-1-2005).pdf

73-mum-2002-claims(granted)-(14-1-2005).doc

73-mum-2002-claims(granted)-(14-1-2005).pdf

73-mum-2002-correspondence(13-1-2005).pdf

73-mum-2002-correspondence(ipo)-(26-2-2007).pdf

73-mum-2002-drawings(25-2-2003).pdf

73-mum-2002-form 1(28-1-2002).pdf

73-mum-2002-form 19(27-2-2003).pdf

73-mum-2002-form 2(granted)-(14-1-2005).doc

73-mum-2002-form 2(granted)-(14-1-2005).pdf

73-mum-2002-form 26(28-1-2002).pdf

73-mum-2002-form 3(28-1-2002).pdf

73-mum-2002-form 4(27-2-2003).pdf

73-mum-2002-form 5(14-1-2005).pdf

73-mum-2002-other documents(28-1-2002).pdf

73-mum-2002-petition under rule 137(14-1-2005).pdf

abstract1.jpg


Patent Number 204545
Indian Patent Application Number 73/MUM/2002
PG Journal Number 41/2008
Publication Date 10-Oct-2008
Grant Date 26-Feb-2007
Date of Filing 28-Jan-2002
Name of Patentee CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING (CDAC)
Applicant Address PUNE UNIVERSITY CAMPUS, GANESHKHIND, PUNE 411 007, MAHARASHTRA.
Inventors:
# Inventor's Name Inventor's Address
1 MR. ADITYA ANIL GOKHALE AN INDIAN CITIZEN, RESIDING AT D2, NIRANT COLONY, KOTHARI BLOCKS, BIBWEWADI, PUNE-411 037.
PCT International Classification Number G 06 F 17/60
PCT International Application Number N/A
PCT International Filing date
PCT Conventions:
# PCT Application Number Date of Convention Priority Country
1 NA