Title of Invention | SEARCH METHOD AND SEARCH SYSTEM USING COMPOUND WORDS |
---|---|
Abstract | Abstract Embodiments of the present invention provide a method for creating an index. The method includes: obtaining valid entries from at least one page; determining at least one compound word, each of which is a combination of at least two valid entries of the obtained valid entries; and create a page index for each compound word. Embodiments of the present invention further provide a search method, an apparatus for creating an index and a search system. In embodiments of the present invention, the combination entries with higher appearance frequency in the page are calculated according to a statistics theory, and the indexes are respectively created for the combination entries. The segmentation granularity of the search entry is reduced when the search is performed, and thus the number of times of the index query, intersection operation and union operation performed by the search engine is reduced. |
Full Text | Method and Apparatus for Creating Index and Search Method and Search System Field of the Invention The present invention relates to the field of computer technologies, and more particularly, to a method and apparatus for creating an index, as well as a search method and a search system. Background of the Invention With the rapid development of the Internet, various information increases explosively. If a user wants to search for information in such an information ocean, it is as difficult as looking for a needle in a bottle of the hay. Every network user is confronted with the problem of information overload and can not find the needed information accurately. The search engine is developed to solve the above problem. A navigation service provided by the search engine is a very important network service on the Internet, and becomes one of the most important Internet applications as the E-mail. The search engine provides the user with an information search service, and classifies all information on the Internet by using a spider program so as to help the user to search for the needed information in masses of Internet information. The principle of the search engine mainly includes three steps: downloading a page from the Internet firstly; creating an index database secondly; searching and sequencing in the index database thirdly. Currently, while the index database is created, only a single word, i.e. unitary word in the page is indexed generally. Thus, when processing a request of the user, the search engine segments a search word of the user into several words (word segmentation), and performs an index query respectively for each segment to obtain a search result of each segment. For example, when the user searches for "Peking gymnasium", the search engine performs the following actions: the search engine segments the "Peking gymnasium" into two words firstly, i.e. "Peking" and "gymnasium"; the search engine performs the index query for the word "Peking" and obtains a result set A secondly; the search engine performs the index query for the word "gymnasium" and obtains a result set B thirdly; the search engine performs an intersection operation for the result sets A and B and obtains an intersection X of the result sets A and B fourthly; the search engine performs a union operation for the result sets A and B and obtains a union Y of the result sets A and B fifthly; and the search engine outputs search results to the user sixthly. The search results are sequenced according to the following: pages in the intersection X are arranged in the front, and pages in the union Y but not in the intersection X are arranged after those of the intersection X. For another example, when the search word is "China people's bank", the search word is segmented to three words: "China", "people's" and "bank", and then the index query is performed three times. If both the intersection operation and union operation are performed for any two words, the final search result may be obtained after the intersection operation and the union operation are respectively performed three times. The disadvantages of creating the index for the single word in the page include: small segmentation granularity of the search word, a large number of times of the index query and a large number of times of performing set operation by the search engine, low query efficiency of the system, and low search speed. Currently, there are some search engines which index a binary word in the page while creating the index database. However, in this process, indexes of many meaningless combinations are also created, which causes space waste. For example, in an existing binary word index, an index is created for each binary combination regardless of a logic relationship between words. Take "I see you are there" as an example, binary words are "I see", "see you", "you are" and "are there", etc. In this case, there are many meaningless combinations, such as "you are", which causes that the user can not enjoy a good search experience. Moreover, the space expands rapidly, which leads to a too large index amount. Summary of the Invention An embodiment of the present invention provides a method for creating an index, including: obtaining valid entries from at least one page; determining at least one compound word, each of which is a combination of at least two valid entries of the obtained valid entries; and creating a page index for each compound word. Another embodiment of the present invention provides a search method, including: creating a page index for at least one compound word which is a combination of at least two valid entries of valid entries obtained from at least one page; segmenting a search word to at least one compound word; and finding a page index created for each of the at least one compound word obtained by segmenting the search word. Another embodiment of the present invention further provides an apparatus for creating an index, including: a first module, configured to obtain valid entries from at least one page, and determine at least one compound word, each of which is a combination of at least two valid entries of the valid entries obtained; and a second module, configured to create a page index for each compound word determined by the first module. Another embodiment of the present invention further provides a search system, including: a first module, configured to create a page index for at least one compound word, which is a combination of at least two valid entries of valid entries obtained from at least one page; and a second module, configured to segment a search word to at least one compound word, find, according to the page index created by the first module for the at least one compound word, a page index created for each compound word obtained by segmenting the search word. In embodiments of the present invention, the combination entries with high appearance frequency in the page are calculated according to a statistics theory, and the indexes are respectively created for the combination entries. The segmentation granularity of the search entry is reduced when the search is performed, and thus the number of times of the index query, intersection operation and union operation performed by the search engine are reduced, the search speed of the search engine is greatly enhanced, the objective of rapidly responding the user is achieved, and the experience of the user is enhanced. At the same time, since indexes are selectively created for multi-unit words via probability statistics, the utilization rate of the index database and accuracy rate of the search performed by the system are increased. Brief Description of the Drawings Figure 1 is a schematic diagram illustrating a structure of a search system in accordance with an embodiment of the present invention. Figured 2 is a flowchart illustrating a process of creating an index database in a search method in accordance with an embodiment of the present invention. Figure 3 is a flowchart illustrating a process after receiving a search request in a search method in accordance with an embodiment of the present invention. Detailed Description of the Invention The present invention will be described in detail hereinafter with reference to accompanying drawings and specific embodiments. Figure 1 is a schematic diagram illustrating a structure of a search system in accordance with an embodiment of the present invention. As illustrated in Figure 1, the search system 10 includes a page downloading module 100, a page database 200, an index module 300, an index database 400 and a search module 500, which are connected sequentially. The page downloading module 100 is configured to automatically retrieve information from the Internet, and save the retrieved information in the page database 200. Generally, the page downloading module 100 automatically visits the Internet via a network spider program capable of automatically collecting pages from the Internet, and jumps to another page via each Uniform Resource Locator (URL) in a current page, which is performed repeatedly. The page downloading module 100 then saves all pages traversed by itself in the page database 200. Automatic information search functions of the search engine are classified into two types. One is periodic search and the other is submitting a website for search. In the periodic search, the page downloading module 100 initiatively controls the spider program at regular intervals (e.g. 28 days) to search Internet websites within a certain IP address scope; once a new website is found, the spider program automatically retrieves information and address of the new website and adds the information and address of the new website into the page database 200. In the submitting a website for search, the owner of the website initiatively submits the address of the website to the search engine; during a period of time (such as two days or several months), the page downloading module 100 of the search engine controls the spider program to scan the website corresponding to the address of the website at regular intervals and to save relevant information in the page database 200. The page database 200 is configured to save all the pages obtained by the page downloading module 100, for use in the search performed by the user. The index module 300 is configured to analyze the pages saved in the page database 200, retrieve relevant page information (including a URL of a page, coding type, keywords contained in contents of the page, locations of the keywords, generation time, size, link relationship with other pages), perform masses of complicated calculations according to a correlation algorithm to obtain correlation (or significance) between each entry and page contents as well as hyper links of each page, create an entry index with the above information obtained, and save the created entry index in the index database 400. In this embodiment, the index module 300 includes a document preprocessing unit 301, a word segmentation unit 302, a word frequency statistics unit 303 and an index creating unit 304. The document preprocessing unit 301 is configured to read a page from the page database 200, convert different data formats in the inputted page into standard data formats, e.g. convert a HTML page, E-mail or PDF file into a text file, and simultaneously filter some script identifiers and useless advertisement information, and then output the data after processed to the word segmentation unit 302. The word segmentation unit 302 is configured to perform word segmentation for page contents whose formats are converted. In order to enhance the efficiency of the system, stop and function words are removed before the word segmentation (certainly, the stop and function words may also be removed after the word segmentation) and only valid entries are left. In this embodiment, the word segmentation unit 302 is configured to segment the page text and title after converted into words according to a dictionary. For example, "I see um you are there" is segmented into five valid entries "I", "see", "you", "are" and "there" after removing the stop word "um". The existing word segmentation algorithm may be classified into three types: a word segmentation algorithm based on character-string match, a word segmentation algorithm based on comprehension and a word segmentation algorithm based on statistics. In this embodiment, the word segmentation algorithm based on character-string match is adopted, which is also called a mechanical word segmentation algorithm. The word segmentation algorithm based on character-string match matches a to-be-analyzed Chinese character string with entries in a machine dictionary which contains a sufficient amount of information according to a certain strategy. If a certain character string is found in the machine dictionary, the match succeeds (i.e. one word is recognized). The word frequency statistics unit 303 is configured to calculate the word frequency, which settles a foundation for creating a compound-word index. A compound word, just as its name implies, is a combination entry (i.e. a binary entry at the least) formed by at least two words (i.e. entries) and is an entry having a definite meaning or relationship. For example, "eat apple" is a compound word formed by two entries "eat" and "apple". For another example, both "Bank of China" and "Ceramic Sand" are respectively compound words formed by two entries. The word frequency of an entry refers to the number of times the entry appears in a text. For example, the entry appears in a text thirty times, the word frequency of the entry is thirty. The word frequency statistics unit 303 first performs various combinations for valid entries outputted from the word segmentation unit 302. For example, entries of the compound word "international stratagem choice and domestic stratagem arrangement of China intellectual property" are combined to be: "China intellectual", "intellectual property", "China intellectual property", "property international", "international stratagem" and "stratagem choice", etc. The word frequency of the combination entries is calculated in an original text of a page. The combination entries are arranged according to the word frequency after the word frequency of all the combination entries is figured out. A combination entry with the word frequency larger than a threshold value is regarded as a compound word and is transmitted to the index creating unit 304. Thus, the compound word obtained according to probability is very close to actual compound word. In addition, a good effect can be achieved without manual intervention. Certainly, the compound word can also be determined by other manners, for example, taking a combination word used commonly in daily life as the compound word. The index creating unit 304 is configured to create indexes for all valid entries outputted from the word segmentation unit 302 and compound words outputted from the word frequency statistics unit 303, and save the created indexes in the index database 400. The index creating unit 304 is further configured to create an index for each valid entry unable to form a compound word and for the compound words outputted from the word frequency statistics unit 303. The index creating unit 304 is further configured to send to the index database 400 the compound words outputted from the word frequency statistics unit 303. The index database 400 is configured to save all the compound words received in a compound word table (not shown in Figure 1). The search module 500 is configured to segment a search word after the user inputs a search request of the search word, and find all relevant pages in accord with the search word from the index database 400, and send the relevant pages to the user after calculating and sequencing the relevant pages. The search module 500 includes a search word segmentation unit 501, a search unit 502 and a result processing unit 503. The search word segmentation unit 501 is configured to segment the search word according to the valid entries as well as the compound word table in the index database 400, and send search entries obtained by the word segmentation to the search unit 502. For example, if the search word is "China people's bank", the valid entries are "China", "people's" and "bank". If "China people's" exists in the compound word table while no "China bank" or "people's bank" exists in the compound word table, the search word is segmented to two search entries "China people's" and "bank". If "China people's", "China bank" and "people's bank" all exist in the compound word table, the search word is segmented to "China people's", "China bank" and "people's bank". If the "China people's bank" also exists in the compound word table, the "China people's bank" is directly taken as the search entry. The search unit 502 searches the index database 400 for the search entries obtained by the segmentation, retrieves pages meeting requirements and sends the retrieved pages to the result processing unit 503. The result processing unit 503 performs the intersection operation and the union operation for the pages received, obtains a result page set, calculates the correlation between the pages and the search entry, and returns the first K pages according to correlation values. K is a natural number and links of the K pages are set in one page. If the user wishes to check a second page, the links of pages in a range from K+1 to 2*K in the sequencing result are set in the second page and returned to the user. In other embodiments of the present invention, all the searched pages may be sent to the user at one time. In other embodiments of the present invention, the pages corresponding to the compound words contained in the search word inputted by the user are arranged in the very front. For better understanding the search system 10 of the search engine according to the embodiments of the present invention, it should be noted that a process of retrieving link information is performed simultaneously with the process of creating indexes, i.e. link information of the pages (including information such as anchor texts and links) is saved in a link database (not shown in Figure 1). According to the link information saved in the link database, page ranking is performed by a page ranking module (not shown in Figure 1). When the user performs the search, the search module 500 searches the index database 400 for relevant pages; at the same time, the page ranking module evaluates the correlation for search results by combining the search request with the link information. The search module 500 sequences the pages according to the correlation, and retrieves a content abstract of the search entries, organizes and returns the pages to the user. For example, the user inputs the search word "China people's bank" for search, the system segments the search word into "China people's" and "bank", performs the index query twice, and returns the search result to the user after performing the intersection operation as well as the union operation once. Compared with the conventional method, the number of times for performing the intersection operation and the union operation is reduced, and the search speed is increased. Figured 2 is a flowchart illustrating a process of creating an index database in a search method in accordance with an embodiment of the present invention. In the search method according to an embodiment of the present invention, the process of creating or updating the index database 400 includes: Block Sll: A page is read, the text of the page is converted into a standard data format. Irrelevant information, such as script identifiers and advertisement information, is filtered. Block SI2: Word segmentation is performed after removing stop and function words. Block S13: Word frequency statistics are performed for various combinations of valid entries obtained by the word segmentation. Block S14: Combination entries with word frequency larger than a threshold value are taken as compound words and are outputted. Block SI5: Indexes are created and saved for the compound words with the word frequency larger than the threshold value and all the valid entries obtained by the word segmentation. In the embodiment of the present invention, after creating the indexes for the compound words, the created indexes may be periodically updated. For example, if a new compound word is added, and an index is created for the new compound word; page information in the indexes for the existing compound words is updated; or a compound word and an index created for the compound word are removed. A new compound word may be added when the number of times that a combination of valid entries appears in the pages changes as larger than the threshold value from lower than the threshold value. A compound word may be removed when the number of times that the compound word appears in the pages changes as lower than the threshold value from larger than threshold value. An embodiment of the present invention also provides a search method. After a search word inputted by a user is received, the following procedure is performed. The search word is segmented according to valid entries and a compound word table, and at least one search entry is obtained. While segmenting the search word, the search word is segmented to compound words firstly. Valid entries which do not form a compound word are taken as search entries directly. When the search word can be segmented to multiple compound words and one of the multiple compound words includes all information of a second compound word, the second compound word is not taken as a search entry anymore. In other words, the compound word taken as the search entry is not contained in any other compound word. For example, when the search word itself is in the compound word table, the search word is taken as a search entry directly. The index query in the index database is performed for the at least one search entry until at least one result set is obtained. The at least one result set is returned to the user. The at least one result set may be sequenced before returned to the user. The at least one result set may be sequenced according to the following: an intersection of all result sets is arranged in the front, and a union of all the result sets except the intersection of all the result sets is arranged after the intersection of all the result sets. The search word "China people's bank" is taken as an example to describe the above procedure. Figure 3 is a flowchart illustrating a process after receiving a search request in a search method in accordance with an embodiment of the present invention. As illustrated in Figure 3: Block S21: The search word is segmented according to a compound word table, and "China people's" and "bank" are obtained. Block S22: An index query in an index database is performed for the "China people's", and a result set Rl is obtained. The index query is performed for the "bank", and a result set R2 is obtained. Block S23: An intersection operation is performed for the result sets Rl and R2, and a set R3 is obtained. Block S24: A union operation is performed for the result sets Rl and R2, and a set R4 is obtained. Block S25: The results are returned to the user after sequenced. Pages in the set R3 are arranged in the front, and pages in the set R4 but not in the set R3 are arranged after the pages in set R3. In other embodiments of the present invention, the search and the segmentation of the compound word may be performed simultaneously, so that a general and whole result can be obtained. The foregoing are only embodiments of the present invention. The protection scope of the present invention, however, is not limited to the above description. Any change or substitution, easily occurring to those skilled in the art, should be covered by the protection scope of the present invention. We CLAIM: 1. A method for creating an index, comprising: obtaining valid entries from at least one page; determining at least one compound word, each of which is a combination of at least two valid entries of the valid entries obtained; and creating a page index for each compound word. 2. The method according to claim 1, comprising: creating a page index for each valid entry obtained from the at least one page; or creating a page index for each valid entry unable to form a compound word. 3. The method according to claim 1, comprising: updating the page index created for said each compound word. 4. The method according to claim 1, comprising: adding at least one compound word and creating a page index for each added compound word; and/or removing at least one compound word and removing a page index created for each removed compound word. 5. The method according to claim 1, wherein said determining the at least one compound word comprises: calculating the number of times that various combinations of at least two valid entries appear in a page; and determining a combination of valid entries with the number of times that the combination appears larger than a threshold value as a compound word. 6. The method according to claim 5, comprising: adding a compound word if the number of times that one of the various combinations of the at least two valid entries appears in a page changes as larger than the threshold value from lower than the threshold value, and creating a page index for the added compound word which is said one of the various combinations of the at least two valid entries; and/or removing a compound word when the number of times that the removed compound word appears in the page changes as lower than the threshold value from larger than the threshold value, and removing a page index created for the removed compound word. 7. A search method, comprising: creating a page index for at least one compound word which is a combination of at least two valid entries of valid entries obtained from at least one page; segmenting a search word to at least one compound word; and finding a page index created for each of the at least one compound word obtained by segmenting the search word. 8. The method according to claim 7, comprising: when creating the page index for the at least one compound word, creating a page index for each valid entry obtained from the at least one page; or creating a page index for each valid entry unable to form a compound word. 9. The method according to claim 8, comprising: if the search word comprises at least one valid entry unable to form a compound word when segmenting the search word, finding a page index created for each valid entry of the search word unable to form a compound word. 10. The method according to claim 7, wherein said segmenting the search word to the at least one compound word comprises: when the search word is segmented to more than one compound word, segmenting the search word to compound words which are not contained in any other compound word. 11. An apparatus for creating an index, comprising: a first module, configured to obtain valid entries from at least one page, and determine at least one compound word, each of which is a combination of at least two valid entries of the valid entries obtained; and a second module, configured to create a page index for each compound word determined by the first module. 12. The apparatus according to claim 11, wherein the second module is configured to create a page index for each valid entry obtained from the at least one page, or create a page index for each valid entry unable to form a compound word. 13. The apparatus according to claim 11, wherein the first module comprises: a first unit, configured to obtain the valid entries from the at least one page; and a second unit, configured to determine the at least one compound word, each of which is the combination of the at least two valid entries of the valid entries obtained. 14. The apparatus according to claim 11, wherein the first module comprises: a third unit, configured to obtain the valid entries from the at least one page; and a fourth unit, configured to calculate the number of times that the combination of the at least two valid entries of the valid entries obtained appears in a page, and determine the combination with the number of times that the combination appears larger than a threshold value as a compound word. 15. A search system, comprising: a first module, configured to create a page index for at least one compound word, which is a combination of at least two valid entries of valid entries obtained from at least one page; and a second module, configured to segment a search word to at least one compound word, find, according to the page index created by the first module for the at least one compound word, a page index created for each compound word obtained by segmenting the search word. 16. The search system according to claim 15, comprising: a third module, configured to save the at least one compound word from the first module and the page index created for each compound word; wherein the second module is configured to find the page index created for said each compound word obtained by segmenting the search word from the third module. 17. The system according to claim 15, wherein the first module comprises: a first unit, configured to obtain the valid entries from the at least one page, and determine the at least one compound word, each of which is the combination of the at least two valid entries of the valid entries obtained; and a second unit, configured to create the page index for said each compound word determined by the first unit. 18. The system according to claim 17, wherein the first unit comprises: a first subunit, configured to obtain the valid entries from the at least one page; and a second subunit, configured to determine the at least one compound word, each of which is the combination of the at least two valid entries of the valid entries obtained by the first sub unit. 19. The system according any of claims 15 to 18, wherein the second module comprises: a third unit, configured to segment the search word to the at least one compound word according to the at least one compound word from the first module; and a fourth unit, configured to receive each compound word sent from the third unit, find the page index created for said each compound word according to the page index created by the first module. 20. The system according to claim 19, wherein the second module comprises: a fifth unit, configured to return a page link of the page index found by the fourth unit to a user. 21. The system according to claim 15, wherein the first module is configured to create a page index for each valid entry obtained from the at least one page, or create a page index for each valid entry unable to form a compound word; and the second module is configured to find a page index created for each valid entry of the search word unable to form a compound word when the search word comprises at least one valid entry unable to form a compound word. |
---|
Patent Number | 279763 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Indian Patent Application Number | 3974/CHENP/2009 | ||||||||||||
PG Journal Number | 05/2017 | ||||||||||||
Publication Date | 03-Feb-2017 | ||||||||||||
Grant Date | 30-Jan-2017 | ||||||||||||
Date of Filing | 07-Jul-2009 | ||||||||||||
Name of Patentee | SHENZHEN SHI JI GUANG SU INFORMATION TECHNOLOGY CO LTD | ||||||||||||
Applicant Address | F16 TENCENT BUILDING KEJIZHONGYI AVENUE YUEHAI STREET NANSHAN DISTRICT SHENZHEN-518057 | ||||||||||||
Inventors:
|
|||||||||||||
PCT International Classification Number | G06F17/30 | ||||||||||||
PCT International Application Number | PCT/CN08/70253 | ||||||||||||
PCT International Filing date | 2008-02-02 | ||||||||||||
PCT Conventions:
|