Title of Invention	"CUT AND PASTE DOCUMENT SUMMARIZATION SYSTEM AND METHOD"
Abstract	A system for generating a summary of an input document comprising an extraction module for receiving the input document and extracting at least one sentence related to a focus of the document, a summary sentence generation module operatively coupled to the extraction module, a grammatical parser operatively coupled to the generation module for parsing the extracted sentences into components in a grammatical representation, a combined lexicon operatively coupled to the generation module and a corpus of human generated summaries operatively coupled to the generation module.

Title of Invention

"CUT AND PASTE DOCUMENT SUMMARIZATION SYSTEM AND METHOD"

Abstract

A system for generating a summary of an input document comprising an extraction module for receiving the input document and extracting at least one sentence related to a focus of the document, a summary sentence generation module operatively coupled to the extraction module, a grammatical parser operatively coupled to the generation module for parsing the extracted sentences into components in a grammatical representation, a combined lexicon operatively coupled to the generation module and a corpus of human generated summaries operatively coupled to the generation module.

Full Text	The present invention relates to a system for generathing a summary in put documetnt Statement of Government Rights The United States Government may have certain rights to the invention set forth herein pursuant to a grant by the National Science Foundation, Contract No. IRI-96-198124 Statement of Related Applications This application claims the benefit of United States provisional patent application, Serial No. 60/120,657, entitled "Summary Generation Through Intelligent Cutting and Pasting of the Input Document" which was filed on February 19, 1999. Field of the Invention The present invention relates generally to information summarization and more particularly relates to systems and methods for generating a summary of a document using automated cutting and pasting of the input document. Background of the Invention The amount of information available today drastically exceeds that of any other time in history. With the continuing expansion of the Internet, this trend will likely continue well into the future. Often, people conducting research of a topic are faced with information overload as the number of potentially relevant documents exceeds the researcher's ability to individually review each document. To address this problem, information summaries are often relied on by researchers to quickly evaluate a document to determine if it is truly relevant to the problem at hand. Given the vast collection of documents available, there is interest in developing and improving the systems and methods used to summarize information content. For individual documents, domain-dependent template based systems anddomain-independent sentence extraction methods are known. Such known systems can provide a reasonable summary of a single document when the domain is known. Many presently available summarizers extract sentences from the original documents to produce summaries. However, since the sentences are generally extracted without supporting context information, the resulting summaries can be incoherent, and in some cases, can convey misleading information. Therefore, there remains a need for systems and methods which can generate a more readable and concise summary of a document. Summary of the Invention It is an object of the present invention to provide a system and method for generating a summary of a document. It is another object of the present invention to provide a summarization system which extracts sentences from an input document and then transforms the extracted sentences such that a concise, coherent and accurate summary results. It is a further object of the present invention to provide a system and method for generating a summary of a set document which use automated cutting and pasting of the input document. A present method for generating a summary of an input document includes extracting at least one sentence from the document. The extracted sentences are parsed into components, preferably in a parse tree representation. Sentence reduction is performed to mark components which can be removed from the extracted sentences. Sentence combination is performed to mark components of two or more sentences which can be merged. Sentence combination also includes a paste operation to operate on the marked components to effect the indicated removal and combination of sentence components. A preferred sentence reduction operation includes measuring the contextual importance of the components; measuring the probabilistic importance of the components based on a given corpus; measuring the importance of the components based on linguistic knowledge; synthesizing the contextual, probabilistic and knowledge based importance measures into a relative importance score for eachcomponent; and marking for removal those components with an importance score below a threshold value. The contextual importance can be measured by establishing a plurality of lexical links of at least one type among the components in a local context in the document and computing a context importance score according to the type, number and direction of lexical links associated with each component. The types of lexical links can include repetition, inflectional variants, derivational variants, synonyms, hypernyms, antonyms, part-of, entailment, and causative links. In a preferred method, the sentence combination operation includes identifying sentence combination operations from a sentence combination subcorpus and developing rules regarding the application of the sentence combination operations. The combination rules are then applied to the extracted sentences after sentence reduction to identify and merge suitable sentences from the original article. The sentence combination operations can be selected from the group including add descriptions, aggregations, substitute incoherent phrases, substitute phrases with more general or more specific information, and mixed operations. A present system for generating a summary of an input document includes an extraction module which receives the input document and extracts at least one sentence related to a focus of the document. A summary sentence generation module is provided, which generally includes a sentence reduction module and a sentence combination module. The system includes a grammatical parser operatively coupled to the generation module for parsing the extracted sentences into components in a grammatical representation. A combined lexicon and a corpus of human generated summaries are operatively coupled to the generation module for use by the operational modules during summary generation. The corpus can further include a sentence generation subcorpus and a sentence reduction subcorpus. The subcorpora can be generated manually or through the use of a decomposition module. Preferably, the sentence reduction module is cooperatively engaged with the combined lexicon and performs context importance processing on the components of the grammatical representation. Context importance processing can include establishing a plurality of lexical links of at least one type for the components and generating a context importance score based on the type and number of links associated with the components. The number and type of lexical links can van-, however a preferred set of lexical link types includes repetition, inflectional variants, derivational variants, synonyms, hypemyms, antonyms, part-of, entailment, and causative links. Preferably, the sentence reduction module further computes the relative importance of the components based on linguistic knowledge stored in the combined lexicon. The sentence reduction module can also be cooperatively engaged with the corpus and perform probabilistic importance processing on the components of the grammatical representation in accordance with the particular corpus used. The sentence combination module can be used to identify sentence combination operations from a sentence combination subcorpus and develop rules regarding the application of the sentence combination operations. The combination module applies the combination rules to the extracted sentences after sentence reduction to identify and merge suitable sentences from the original article. A decomposition module in accordance with the present application can be used to evaluate human generated summaries and map corresponding portions of the summaries to the original documents. The decomposition module indexes words in the summary and the original document. A Hidden Markov Model is then built based on heuristic rules to determine the probability of phrases in the summary sentence matching a given phrase in the original document. A Viterbi algorithm can then be employed to determine the best solution for the Hidden Markov Model and generate a mapping between summary phrases and the original document. This mapping can be used to generate, among other things, a sentence reduction subcorpus and a sentence combination subcorpus. Such a decomposition module can be operatively coupled to the corpus in the summary generation system described above.Statement of the Invention The present invention relates to a system for generating a summary of an input document, said system comprising: a) an input unit able to input documents, said input unit contained in a plurality of client devices; b) a plurality of servers connected to said plurality of client devices and storing the input documents and a combined lexicon; c) a central processing unit contained in said plurality of servers and communicatively connected to the input unit, said central processing unit having an extraction unit, a summary sentence generation unit and a grammatical parser unit; d) said extraction unit connected to the input unit and receiving the input documents and extracting at least one sentence related to a focus of the document; e) said summary sentence generation unit in operatively coupled to said extraction unit; f) said grammatical parser operatively coupled to the generation unit for parsing the extracted sentences into components in a grammatical representation; and g) said combined lexicon operatively coupled to the generation unit; and a corpus of human generated summaries operatively coupled to the generation unit; h) a communication network connected to the said plurality of servers, plurality of client devices; and i) a user interface means located in the said client device providing a user interface by connecting the said client device with the communications network and a connection means for verifying the integrity of the client device connection with the communication network.Brief Description of the Drawing Further objects, features and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying figures showing illustrative embodiments of the invention, in which Figure 1 is a block diagram of the system architecture of the present document summarization system; Figure lisa flow chart illustrating an exemplary embodiment of a sentence reduction operation in accordance with the summarization system of Figure l; Figure 3 is a pictorial diagram of an exemplary parse tree sentence representation; Figure 4 is a flow chart illustrating an exemplary embodiment of a sentence combination operation in accordance with the present summarization system of Figure 1; Figure 5 is a table illustrating exemplary sentence combination operations for the sentence combination operation of Figure 4; Figure 6 is a table illustrating exemplary sentence combination rules for applying the sentence combination operations of Figure 5; Figure 7 is a flow diagram illustrating the operation of the corpus decomposition module of Figure 1; and Figure 8 is a pictorial diagram of a Hidden Markov Model for use in a corpus decomposition module. Throughout the figures, the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the subject invention will now be described in detail with reference to the figures, it is done so in connection with the illustrative embodiments. It is intended that changes and modifications can be made to the described embodiments without departing from the true scope and spirit of the subject invention as defined by the appended claims.Detailed Description of Preferred Embodiments The present summarization systems and methods generate a generic, domain-independent single-document summary of a received input document. Figure 1 is a block diagram illustrating the system architecture of an exemplary embodiment of the present summarization system. Such a system can be implemented on various computer hardware, software and operating system platforms. The particular system components selected are not critical to the practice of the present invention. For example, the present system of Figure 1, can be implemented on a personal computer system, such as an IBM compatible system. Referring to Figure 1, an input document 105 in computer readable form is applied to an extraction module 110 which determines the focus of the document 105 and extracts sentences from the document accordingly. A number of extraction techniques can be used in the extraction module 110. In a preferred embodiment, the extraction module 110 links words in a sentence to other words in the input document 105 through repetitions, morphological relations and lexical relations. An importance score can then be computed for each word in the article 105 based on the number, type and direction (forward, backward) of the lexical links associated with the word. A sentence score can be determined by adding the importance score for each of the words in the sentence and normalizing the sum based on the number of words in the sentence. The sentences can then extracted based on the highest relative sentence scores. The extraction module 110 provides the extracted sentences 115 to a generation module 120. The generation module 120 also receives the original document 105 as an input. The generation module 120 further includes a sentence reduction module 135 and a sentence combination module 140. The sentence reduction module 135 provides a marked up parse tree as input data for the sentence combination module 140, which generates and outputs the summary sentences. The generation module 120 is operatively coupled to a corpus of human-written summaries 165, a lexical database 170, and a combined reusable lexicon 175.The corpus 165 generally includes a broad collection of human-generated summaries as well as the corresponding original documents. The corpus 165 can also include a sentence reduction subcorpus 165a and a sentence combination subcorpus 165b which can be generated manually or through a decomposition module. The sentence reduction subcorpus 165a includes entries of sentence pairs linking an original sentence to a human reduced sentence. The sentence combination subcorpus 165b includes mappings from human combined sentences to two or more original sentences. A suitable exemplary corpus 165 was generated using Communications-related Headlines, a free daily online news service provided by the Benton Foundation (http://www.benton.org). The articles from this particular service are communication related, but the topics involved are very broad, including law, company mergers, new technologies, labor issues and so on. Of course, other sources of document summaries can also be used to generate a suitable corpus. To insure that the resulting corpus is somewhat generic, the articles from the selected source should not possess a particular writing style. Thus, preferred sources feature articles from multiple sources or articles from various sections of one or more source. A suitable corpus 165 was generated in four major steps. First, human-written, single document summaries are received from the source. Second, the original documents are retrieved and correlated to the respective summary. The retrieved documents are then "cleaned" by removing irrelevant material such as indexes and advertisements. Finally, the quality of the correspondence between the summary and the original document is verified. The cleaning and verification processes are generally performed manually. The sentence reduction subcorpus 165a and sentence combination subcorpus 165b entries were generated by the decomposition module 185, the operation of which is explained below. The lexical database 170 can take the form of the WordNet database, which is described in the article "WordNet: A lexical Database for English", by G.A. Miller, Communications of the ACM, Vol. 38, No. 11. pp. 39-41, November 1995. A suitable embodiment of the combined lexicon 175 can be constructed by combining multiple, large-scale resources such as WordNet, the English Verb Classes and ID Alternations (EVCA) database, me VAJMLE.A syntax dictionary and the Brown Corpus tagged with WordNet senses. The combined lexicon 175 can be formed by encoding the EVCA database with COMLEX compatible syntax and merging the EVCA into the COMLEX database. This results in each verb in the combined lexicon 175 being marked with a list of subcategorizations and alternate syntactic patterns. Preferably, WordNet is added to the EVCA/COMLEX combination to refine the syntactic information and provide additional lexical information to the lexicon 175. The generation module 120 is also cooperatively coupled to natural language processing (NLP) tools such as a syntactic parser 180 and a co-reference resolving module 190 which can include anaphora resolution. These tools can be software modules which are called by the generation module 120. A suitable syntactic parser 180 is the English Slot Grammar (ESG) parser available from International Business Machines, Inc. A suitable co-reference resolving module 190 is the Deep Read system, available from Mitre, Inc. Figure 2 is a flow diagram further illustrating the operation of the sentence reduction module 135. The reduction module 135 receives extracted sentences 115 as input (step 205). The reduction module invokes the parser 180 to grammatically parse the extracted sentences 115 and generate a parse tree representation of the sentences (step 210). In step 215 contextual importance is determined by detecting lexical links among words in a local context and then computing an importance score based on the number, type and direction of lexical links detected. The context processing step 215 generates an importance score for each node in the parse tree indicating the relative importance of the nodes to the focus of the input document 105. The number, type and direction (forward, backward) of lexical links used in the practice of the present invention may vary. An empirical study has demonstrated that the following nine lexical relation types provide a meaningful representation of contextual importance: (1) repetition, (2) inflectional variants, (3) derivational variants, (4) synonyms, (5) hypernyms, (6) antonyms, (7) part-of, (8) entailment (for example: kill —* die), and (9) causative (for example: eat —* chew). \Inflectional variants (2) and derivational variants can be derived from the CELEX database content, available from the Centre for Lexical Information, Max Planck Institute for Psycholinquistics, Nijmegen, which can be in the combined lexicon 175. The other lexical relations can be extracted using the separate lexical database 170, such as WordNet. To frame the local context of a word, a number of sentences before and after the current sentence location are evaluated for the presence of lexical links. The number of sentences selected for this operation involves balancing the level of contextual depth to the amount of processing overhead. Using the five sentences before and the five sentences after the current sentence has been found to provide reasonable local context without incurring excessive processing overhead. After the lexical links have been identified (step 215 a), an importance score for each word in the extracted sentences can be calculated (step 215 b). Lexical links from the current sentence to subsequent sentences are referred to as forward links and those from the current sentence to preceding sentences are referred to as backward links. The importance score, referred to as the context weight, can be computed as follows: 9 1) FonvardWeighl(v>) = Z (WixLi(v)) i = l 9 2) BackwardWeight(w) = £ (WuBmani(w)) » = 1 3) TotalWeight (w) = ForwardWeight (w) + BackwardWeight (w) max( ForwardWeight (w). Backward (weight (w)) 4) Ratio(w) = Totalweight(w)) 5) ContextWeight = Ratio(\v)xTotalWeight(v) where ForwardWeight(w) computes the weight of forward links, BackwardWeight(w) computes the weight of backward links, TotalWeight(w) represents the sum of all links and Ratio(w) computes a weight for the location of the word. To compute the weight of various lexical links, each type of link is assigned a weighted value according to its relative importance. For example, the nine lexical relations set forth above were presented in descending order of importance and accordingly can be assigned linearly decreasing weights such as (1,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2). The value ofRatio(w) represents the value assigned based on the location of the word in the original document. For example, when a sentence introduces a topic or ends a topic, it is considered more important and the components of those sentences will be assigned a relatively higher location value. The use of various types of lexical relations improves the relatedness of a word to the main topic. Although simple relations like repetition and synonymy can be used to determine a measure of contextual importance, these surface relations are generally unable to detect more subtle connections between words. Following context processing (step 215) the reduction module 135 can perform interdependency processing using a probability analysis based on the corpus 165 of human-written reduction based sentences. Such an analysis can indicate the degree of correlation between components in a sentence, such as the relationship between a verb and its subclause. The probability computation can be performed based on parse trees using probabilities to indicate the degree of correlation between a parent node and its child nodes in the parse tree. Figure 3 illustrates an exemplary fragment of a parse tree used to explain the operation of the probability computation. In Fig. 3, The main Verb "give" is the parent nodes 300, and it has four children nodes: subclause conjunct 505, subject 310, indirect object 315 and object 320, respectively. The parse tree can ilso include further levels below the children nodes, such as nodes ndet 325 and adjp 330 below child node obj 320 and nodes konj 335 and rconj 340 below node adjp 330, respectively. To measure the interdependency between the verb give and its subclause 305, the probability that the subclause is removed when the verb is give, can be represented by PROB("when_clause is removed'werb = give). This conditional probability is transformed using Hayes's rule to: ROBf'v = give']" when dausBremov«tf')FRDB()"when dausB is removed" n a similar fashion, the probabilities that a clause will be reduced or remain unchanged can be calculated in a similar manner. The probability associated with the other child nodes from the current root node is calculated in a similar manner. After the probabilities for each of the first level child nodes is calculated, each of the child nodes in the current level of the tree is then treated as a parent node and the process is repeated through each descending level of the parse tree until every parent-child node pair has been considered. The probabilities for the corpus 165 can be calculated and stored in a look-up table which is used when a reduction module 135 is run. The context processing of step 215 and probability processing of step 220 provide a relative ranking of sentence components. However, this ranking does not necessarily provide a measure of which components be included to provide a grammatically correct summary sentence. Thus, preferably, after the probability analysis of step 220, reduction processing based on linguistic knowledge is performed (step 225). In this operation, the reduction module 135 works in cooperation with the combined lexicon 175. The linguistic knowledge processing step 225 operates with the combined lexicon 175 to evaluate the parse tree for each extracted sentence 115 and determine which children nodes are essential to maintain the grammatical correctness of the component represented by the parent node. Linguistic judgments are identified in the parse tree by assigning a binary tag to each node in the parse tree. The value of a tag is either essential or reducible^ indicating whether or not a node is indispensable to its parent node. For example, referring to Figure 3, the lexicon 175 will indicate that the verb give needs a subject and two objects. Thus the child nodes subj 310, iobj 315 and obj 320 can be marked as essential. In this case, the child node subclause 305 is then rendered non-essential and will be marked as reducible. The lexicon 175 can also include collocations, such as consist of or replace.... with...., which prevents removal of indispensable components. Once the linguistic knowledge processing is applied in step 225, a reduction operation (step 230) can take place. The reduction operation process can be viewed as a series of decision making steps along the edges of a parse tree. BeginningJM with the root node of the parse tree, the immediate child nodes are evaluated to determine which child nodes can be removed. A child node can be removed if three conditions are satisfied. The first condition is that the component is not a local focus. To determine whether a component is a local focus, the ratio of the context importance score (step 215b) of the child node to that of the root node is calculated. The child node is then considered unimportant if the calculated ratio is smaller than a threshold value. The second condition is that the corpus probability value (step 220) indicating that the special syntactic component of the root is removed is higher than a threshold. The final condition is that the linguistic analysis in step 225 indicates that the child node as reducible. When the conditions to remove a child node are satisfied, the child node is tagged as "removable" and processing on that branch of the tree terminates. For the child nodes which are retained, the lower levels of the parse tree are evaluated by repeating this process in a similar manner through the tree. The reduction operation step 230 is complete when there are no more nodes to consider. This also concludes processing of the sentence reduction module and results in the parse trees being marked with those components which can be removed or altered by the subsequent paste module 150 operation. Following processing by the sentence reduction module 135, processing by the sentence combination module 140 is performed. The operation of the sentence combination module 140 is further illustrated in the flow chart of Figure 4. Using the sentence combination subcorpus!65b, the sentence combination module evaluates the extracted sentence to identify applicable sentence combination operations (step 410). Figure 5 is a table illustrating combination operations such as: add descriptions 510, aggregations 515, substitute incoherent phrases 520, substitute phrases with more general or more specific information 525 and mixed operations 530. From the sentence combination subcorpus 165b, sentence combination rules are also established to determine whether and how the sentence combination operations of step 410 will take place (step 415). The result is a set of sentence combination rules 420, such as those set forth in Figure 6. The rules illustrated in Figure 6 are exemplary and non-exhaustive. These sentence combination rules 420 were determined empirically by manual inspection of the sentence combination subcorpus 165b. Using the input article 105 and the extracted sentences reduced by the sentence reduction module 135 the sentence combination module 140 in cooperation with the co-reference resolution module 190 applies the sentence combination rules 420 (step 425). The result of step 425 is that the parse trees of the sentences being combined are appropriately tagged to effect the sentence combination. The combination operation is then realized in step 430 using a tree adjoining grammar (TAG) formalism, as described by A. Joshi, "Introduction to Tree-Adjoining Grammars," in Mathematics of Language, John Benjamins, Amsterdam, 1987. In this way, the sentence combination module 140 performs a paste operation on the marked parse trees and generates a summary sentence. The document summary is generated by combining the summary sentences. The most straight forward combination is to maintain the order of sentences as they were extracted, however, other sequencing arrangements can also be employed. As noted above in connection with Figure 1, the corpus decomposition module 185 operates on the corpus 165 to generate the sentence reduction subcorpus 165a and the sentence combination subcorpus 165b. The decomposition module 185 generally operates to evaluate the human written summaries in the corpus 165, compare the summary sentences to the original document, determine if a summary sentence was generated by a cut and past operation and identify where the components of the summary sentences were taken from in the original documents. The operation of the decomposition module 185 is illustrated in the flow diagram of Figure 7. Referring to Figure 7, the decomposition module 185 uses the human-generated summary and original document as inputs to an indexing operation (step 705). During indexing, each word in the original document is indexed according to its positions in the original document. A convenient way of referencing these occurrences is by sentence number and word number in the original document. To evaluate the index of words, a set of heuristic rules is developed by manual inspection of the corpus 165. Such inspection reveals that human-generated summaries often include one or more of six operations: sentence reduction, sentence combination, syntactic transformation, lexical paraphrasing, generalization/specification, and content reordering. The heuristic rules can be represented using a bigram probability PROS (W2 = (S2, W2) \ W, = (S,, W,)) (abbreviated as PROB(W2\W,} in the following discussion). The probability values can be assigned in the following manner: •IF ((S, - S2)and(Wi -W2- 1)) (i.e., the words are in two adjacent positions in the document), THENPROB(W2\ W,) is assigned the maximal value, PI .(Rule: Two adjacent words in the summary are most likely to come from two adjacent words in the document.) •IF ((S, = S2)and(W, •IF((S, = Si)and(W, > WJ), THENPROB(W2\ W,) is assigned the third highest value, P3. (Rule: Adjacent words in the summary are likely to come from the same sentence in the document but reverse their relative orders, such as in the case of sentence reduction with syntactic transformations.) •IF(S2 - CONST •IF(S2 •IF(\S2-S,\ CONST) THENPROB(W2\ W,) is assigned a small value, P6. (Rule: Adjacent words in the summary are not very likely to come from sentences far apart.) Based on the above heuristic principles, a Hidden Markov Model can be generated, such as is illustrated in Figure 8 (step 710). The nodes in the Hidden Markov Model represent possible positions in the document, and the edges output the probability of going from one node to another. This Hidden Markov Model is used in finding the most likely position sequence in a subsequent processing operation. Assigning values to P1-P6 is performed empirically. For example, the maximal value can be assigned 1 and others are assigned evenly decreasing values 0.9, 0.8 and so on. The order of the above rules is based on the empirical observations on a particular set of summaries. These values, however, can be adjusted or even trained for different corpora. AViterbi algorithm can be used to evaluate the Hidden Markov Model and find the most likely sequence of words incrementally (step 715). The Viterbi algorithm first finds the most likely sequence for (Word,Word2), for each possible position of Word2. This information is then used to compute the most likely sequence for (Word,Word2Word3), for each possible position of Word3. The process repeats until all the words in the sequence have been considered. After evaluation by the Viterbi algorithm, post-editing operations can be used to cancel mismatches that occur in the corpus analysis. The result is that summary sentences are matched to the corresponding phrases in the document. Once the summary sentences are so matched, it is a simple endeavor to sort the various matchings to one of the sentence reduction subcorpus 165a and sentence combination subcorpus 165b. In addition, the decomposition module 185 can be used as a stand alone tool, apart from the rest of the present summary generation system, to perform various summary analysis operations. Although the present invention has been described in connection with specific exemplary embodiments, it should be understood that various changes, substitutions and alterations can be made to the disclosed embodiments without departing from the spirit and scope of the invention as set forth in the appended claims. We claim: 1. A system for generating a summary of an input document, said system comprising: . a) an input unit-7a inputof documents, said input unit contained in a plurality of client devices; b) a plurality of servers connected to said plurality of client devices and storing the input documents and a combined lexicon; c) a central processing unit contained in said plurality of servers and communicatively connected to the input unit, said central processing unit having an extraction unit, a summary sentence generation unit and a grammatical parser unit; d) said extraction unit connected to the input unit and receiving the input documents and extracting at least one sentence related to a focus of the document; e) said summary sentence generation unit in operatively coupled to said extraction unit; f) said grammatical parser operatively coupled to the generation unit for parsing the extracted sentences into components in a grammatical representation; and g) said combined lexicon operatively coupled to the generation unit; and a corpus of human generated summaries operatively coupled to the generation unit; h) a communication network connected to the said plurality of servers, plurality of client devices; and i) a user interface means located in the said client device providing a user interface by connecting the said client device with the communications network and a connection means for verifying the integrity of the client device connection with the communication network. 2. The system as claimed in claim 1, wherein the generation unit consists of a sentence reduction unit. 3. The system as claimed in claim 2, wherein the said sentence reduction unit is cooperatively engaged with the corpus and performs probabilistic importanceprocessing on the components of the grammatical representation in accordance with the corpus. 4. The system as claimed in claim 3, wherein the sentence reduction unit is cooperatively engaged with the combined lexicon and performs context importance processing on the components of the grammatical representation and computes the relative importance of the components based on linguistic knowledge stored in the combined lexicon. 5. The system as claimed in claim 4, wherein the context importance processing consists of establishing a plurality of lexical links of at least one type for the components and generating a context importance score based on the type and number of links associated with the components. 6. The system as claimed in claim 1, wherein the generation unit consists of a sentence combination unit. 7. The system as claimed in claim 6, wherein the sentence combination unit is operatively coupled to the corpus, said sentence combination unit identifies at least one sentence combination operation; establishes at least one rule for applying the sentence combination operation; and applies the at least one rule to combine at least two extracted sentences. 8. The system as claimed in claim 7, wherein the at least one sentence combination operation is selected from the group consisting of add descriptions, aggregations, substitute incoherent phrases, substitute phrases with more general or more specific information, and mixed operations. 9. The system as claimed in claim 8, wherein the at least one rule to combine extracted sentences comprises replacing a partial name phrase with a full name phrase. 10. The system as claimed in claim 9, wherein the at least one rule to combine extracted sentences comprises determining if two sentences having a common subject are proximate and whether at least one sentence is marked for reduction then removing the subject of the second sentence and combining with the first sentence using the connective"and". 11. The system as claimed in claim 1, wherein the generation unit comprises a sentence reduction unit and a sentence combination unit.The system as claimed in claim 11, wherein the sentence reduction unit is cooperatively engaged with the combined lexicon and performs context importance processing on the components of the grammatical representation and computes the relative importance of the components based on linguistic knowledge stored in the combined lexicon. 12. The system as claimed in claim 12, wherein the context importance processing comprises establishing a plurality of lexical links of a least one type for the components and generating a context importance score based on the type and number of links associated with the components. 13. The system as claimed in claim 13, wherein the sentence reduction unit is cooperatively engaged with the corpus and performs probabilistic importance processing on the components of the grammatical representation in accordance with the corpus. 14. The system as claimed in claim 10, wherein the sentence combination unit is operatively coupled to the corpus and wherein the sentence combination unit identifies at least one sentence combination operation; establishes at least one rule for applying the sentence combination operation and applies the at least one rule to combine at least two extracted sentences. 15. The system as claimed in claim 15, wherein the at least one sentence combination operation is selected from the group consisting of add descriptions, aggregations, substitute incoherent phrases, substitute phrases with more general or more specific information, and mixed operations. 16. The system as claimed in claim 16, wherein the at least one rule to combine extracted sentences comprises replacing a partial name phrase with a full name phrase. 17. The system as claimed in claim 17, wherein the at least one rule to combine extracted sentences comprises determining if two sentences having a common subject are proximate and whether at least one sentence is marked for reduction then removing the subject of the second sentence and combining with the first sentence using the connective"and." 18. The system as claimed in claim 1, the corpus is operatively coupled a decomposition unit, the decomposition unit analyzing the corpus and generating a sentence reduction subcorpus and a sentence combination subcorpus.20. A system for generating a summary of an input document as herein substantially described with respect to accompanying figures.

Full Text

The present invention relates to a system for generathing a summary in put documetnt Statement of Government Rights
The United States Government may have certain rights to the invention set forth herein pursuant to a grant by the National Science Foundation, Contract No. IRI-96-198124
Statement of Related Applications
This application claims the benefit of United States provisional patent application, Serial No. 60/120,657, entitled "Summary Generation Through Intelligent Cutting and Pasting of the Input Document" which was filed on February 19, 1999.
Field of the Invention
The present invention relates generally to information summarization and more particularly relates to systems and methods for generating a summary of a document using automated cutting and pasting of the input document.
Background of the Invention
The amount of information available today drastically exceeds that of any other time in history. With the continuing expansion of the Internet, this trend will likely continue well into the future. Often, people conducting research of a topic are faced with information overload as the number of potentially relevant documents exceeds the researcher's ability to individually review each document. To address this problem, information summaries are often relied on by researchers to quickly evaluate a document to determine if it is truly relevant to the problem at hand.
Given the vast collection of documents available, there is interest in developing and improving the systems and methods used to summarize information content. For individual documents, domain-dependent template based systems anddomain-independent sentence extraction methods are known. Such known systems can provide a reasonable summary of a single document when the domain is known.
Many presently available summarizers extract sentences from the original documents to produce summaries. However, since the sentences are generally extracted without supporting context information, the resulting summaries can be incoherent, and in some cases, can convey misleading information.
Therefore, there remains a need for systems and methods which can generate a more readable and concise summary of a document.
Summary of the Invention
It is an object of the present invention to provide a system and method for generating a summary of a document.
It is another object of the present invention to provide a summarization system which extracts sentences from an input document and then transforms the extracted sentences such that a concise, coherent and accurate summary results.
It is a further object of the present invention to provide a system and method for generating a summary of a set document which use automated cutting and pasting of the input document.
A present method for generating a summary of an input document includes extracting at least one sentence from the document. The extracted sentences are parsed into components, preferably in a parse tree representation. Sentence reduction is performed to mark components which can be removed from the extracted sentences. Sentence combination is performed to mark components of two or more sentences which can be merged. Sentence combination also includes a paste operation to operate on the marked components to effect the indicated removal and combination of sentence components.
A preferred sentence reduction operation includes measuring the contextual importance of the components; measuring the probabilistic importance of the components based on a given corpus; measuring the importance of the components based on linguistic knowledge; synthesizing the contextual, probabilistic and knowledge based importance measures into a relative importance score for eachcomponent; and marking for removal those components with an importance score below a threshold value.
The contextual importance can be measured by establishing a plurality of lexical links of at least one type among the components in a local context in the document and computing a context importance score according to the type, number and direction of lexical links associated with each component. The types of lexical links can include repetition, inflectional variants, derivational variants, synonyms, hypernyms, antonyms, part-of, entailment, and causative links.
In a preferred method, the sentence combination operation includes identifying sentence combination operations from a sentence combination subcorpus and developing rules regarding the application of the sentence combination operations. The combination rules are then applied to the extracted sentences after sentence reduction to identify and merge suitable sentences from the original article. The sentence combination operations can be selected from the group including add descriptions, aggregations, substitute incoherent phrases, substitute phrases with more general or more specific information, and mixed operations.
A present system for generating a summary of an input document includes an extraction module which receives the input document and extracts at least one sentence related to a focus of the document. A summary sentence generation module is provided, which generally includes a sentence reduction module and a sentence combination module. The system includes a grammatical parser operatively coupled to the generation module for parsing the extracted sentences into components in a grammatical representation. A combined lexicon and a corpus of human generated summaries are operatively coupled to the generation module for use by the operational modules during summary generation.
The corpus can further include a sentence generation subcorpus and a sentence reduction subcorpus. The subcorpora can be generated manually or through the use of a decomposition module.
Preferably, the sentence reduction module is cooperatively engaged with the combined lexicon and performs context importance processing on the components of the grammatical representation. Context importance processing can
include establishing a plurality of lexical links of at least one type for the components and generating a context importance score based on the type and number of links associated with the components. The number and type of lexical links can van-, however a preferred set of lexical link types includes repetition, inflectional variants, derivational variants, synonyms, hypemyms, antonyms, part-of, entailment, and causative links.
Preferably, the sentence reduction module further computes the relative importance of the components based on linguistic knowledge stored in the combined lexicon. The sentence reduction module can also be cooperatively engaged with the corpus and perform probabilistic importance processing on the components of the grammatical representation in accordance with the particular corpus used.
The sentence combination module can be used to identify sentence combination operations from a sentence combination subcorpus and develop rules regarding the application of the sentence combination operations. The combination module applies the combination rules to the extracted sentences after sentence reduction to identify and merge suitable sentences from the original article.
A decomposition module in accordance with the present application can be used to evaluate human generated summaries and map corresponding portions of the summaries to the original documents. The decomposition module indexes words in the summary and the original document. A Hidden Markov Model is then built based on heuristic rules to determine the probability of phrases in the summary sentence matching a given phrase in the original document. A Viterbi algorithm can then be employed to determine the best solution for the Hidden Markov Model and generate a mapping between summary phrases and the original document. This mapping can be used to generate, among other things, a sentence reduction subcorpus and a sentence combination subcorpus. Such a decomposition module can be operatively coupled to the corpus in the summary generation system described above.Statement of the Invention
The present invention relates to a system for generating a summary of an input document, said system comprising:
a) an input unit able to input documents, said input unit contained in a plurality of
client devices;
b) a plurality of servers connected to said plurality of client devices and storing the
input documents and a combined lexicon;
c) a central processing unit contained in said plurality of servers and
communicatively connected to the input unit, said central processing unit having
an extraction unit, a summary sentence generation unit and a grammatical parser
unit;
d) said extraction unit connected to the input unit and receiving the input
documents and extracting at least one sentence related to a focus of the
document;
e) said summary sentence generation unit in operatively coupled to said extraction
unit;
f) said grammatical parser operatively coupled to the generation unit for parsing the
extracted sentences into components in a grammatical representation; and
g) said combined lexicon operatively coupled to the generation unit; and
a corpus of human generated summaries operatively coupled to the generation
unit;
h) a communication network connected to the said plurality of servers, plurality of
client devices; and i) a user interface means located in the said client device providing a user interface
by connecting the said client device with the communications network and a
connection means for verifying the integrity of the client device connection with
the communication network.Brief Description of the Drawing
Further objects, features and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying figures showing illustrative embodiments of the invention, in which
Figure 1 is a block diagram of the system architecture of the present document summarization system;
Figure lisa flow chart illustrating an exemplary embodiment of a sentence reduction operation in accordance with the summarization system of Figure l;
Figure 3 is a pictorial diagram of an exemplary parse tree sentence representation;
Figure 4 is a flow chart illustrating an exemplary embodiment of a sentence combination operation in accordance with the present summarization system of Figure 1;
Figure 5 is a table illustrating exemplary sentence combination operations for the sentence combination operation of Figure 4;
Figure 6 is a table illustrating exemplary sentence combination rules for applying the sentence combination operations of Figure 5;
Figure 7 is a flow diagram illustrating the operation of the corpus decomposition module of Figure 1; and
Figure 8 is a pictorial diagram of a Hidden Markov Model for use in a corpus decomposition module.
Throughout the figures, the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the subject invention will now be described in detail with reference to the figures, it is done so in connection with the illustrative embodiments. It is intended that changes and modifications can be made to the described embodiments without departing from the true scope and spirit of the subject invention as defined by the appended claims.Detailed Description of Preferred Embodiments
The present summarization systems and methods generate a generic, domain-independent single-document summary of a received input document. Figure 1 is a block diagram illustrating the system architecture of an exemplary embodiment of the present summarization system. Such a system can be implemented on various computer hardware, software and operating system platforms. The particular system components selected are not critical to the practice of the present invention. For example, the present system of Figure 1, can be implemented on a personal computer system, such as an IBM compatible system.
Referring to Figure 1, an input document 105 in computer readable form is applied to an extraction module 110 which determines the focus of the document 105 and extracts sentences from the document accordingly. A number of extraction techniques can be used in the extraction module 110. In a preferred embodiment, the extraction module 110 links words in a sentence to other words in the input document 105 through repetitions, morphological relations and lexical relations. An importance score can then be computed for each word in the article 105 based on the number, type and direction (forward, backward) of the lexical links associated with the word. A sentence score can be determined by adding the importance score for each of the words in the sentence and normalizing the sum based on the number of words in the sentence. The sentences can then extracted based on the highest relative sentence scores.
The extraction module 110 provides the extracted sentences 115 to a generation module 120. The generation module 120 also receives the original document 105 as an input. The generation module 120 further includes a sentence reduction module 135 and a sentence combination module 140. The sentence reduction module 135 provides a marked up parse tree as input data for the sentence combination module 140, which generates and outputs the summary sentences.
The generation module 120 is operatively coupled to a corpus of human-written summaries 165, a lexical database 170, and a combined reusable lexicon 175.The corpus 165 generally includes a broad collection of human-generated summaries as well as the corresponding original documents. The corpus 165 can also include a sentence reduction subcorpus 165a and a sentence combination subcorpus 165b which can be generated manually or through a decomposition module. The sentence reduction subcorpus 165a includes entries of sentence pairs linking an original sentence to a human reduced sentence. The sentence combination subcorpus 165b includes mappings from human combined sentences to two or more original sentences.
A suitable exemplary corpus 165 was generated using
Communications-related Headlines, a free daily online news service provided by the
Benton Foundation (http://www.benton.org). The articles from this particular service
are communication related, but the topics involved are very broad, including law,
company mergers, new technologies, labor issues and so on. Of course, other sources
of document summaries can also be used to generate a suitable corpus. To insure that
the resulting corpus is somewhat generic, the articles from the selected source should
not possess a particular writing style. Thus, preferred sources feature articles from
multiple sources or articles from various sections of one or more source. A suitable
corpus 165 was generated in four major steps. First, human-written, single document
summaries are received from the source. Second, the original documents are retrieved
and correlated to the respective summary. The retrieved documents are then
"cleaned" by removing irrelevant material such as indexes and advertisements.
Finally, the quality of the correspondence between the summary and the original
document is verified. The cleaning and verification processes are generally performed
manually. The sentence reduction subcorpus 165a and sentence combination
subcorpus 165b entries were generated by the decomposition module 185, the
operation of which is explained below. The lexical database 170 can take the form of the WordNet database, which is described in the article "WordNet: A lexical Database for English", by G.A. Miller, Communications of the ACM, Vol. 38, No. 11. pp. 39-41, November 1995. A suitable embodiment of the combined lexicon 175 can be constructed by combining multiple, large-scale resources such as WordNet, the English Verb Classes and ID
Alternations (EVCA) database, me VAJMLE.A syntax dictionary and the Brown Corpus tagged with WordNet senses. The combined lexicon 175 can be formed by encoding the EVCA database with COMLEX compatible syntax and merging the EVCA into the COMLEX database. This results in each verb in the combined lexicon 175 being marked with a list of subcategorizations and alternate syntactic patterns. Preferably, WordNet is added to the EVCA/COMLEX combination to refine the syntactic information and provide additional lexical information to the lexicon 175.
The generation module 120 is also cooperatively coupled to natural language processing (NLP) tools such as a syntactic parser 180 and a co-reference resolving module 190 which can include anaphora resolution. These tools can be software modules which are called by the generation module 120. A suitable syntactic parser 180 is the English Slot Grammar (ESG) parser available from International Business Machines, Inc. A suitable co-reference resolving module 190 is the Deep Read system, available from Mitre, Inc.
Figure 2 is a flow diagram further illustrating the operation of the sentence reduction module 135. The reduction module 135 receives extracted sentences 115 as input (step 205). The reduction module invokes the parser 180 to grammatically parse the extracted sentences 115 and generate a parse tree representation of the sentences (step 210). In step 215 contextual importance is determined by detecting lexical links among words in a local context and then computing an importance score based on the number, type and direction of lexical links detected. The context processing step 215 generates an importance score for each node in the parse tree indicating the relative importance of the nodes to the focus of the input document 105.
The number, type and direction (forward, backward) of lexical links used in the practice of the present invention may vary. An empirical study has demonstrated that the following nine lexical relation types provide a meaningful representation of contextual importance: (1) repetition, (2) inflectional variants, (3) derivational variants, (4) synonyms, (5) hypernyms, (6) antonyms, (7) part-of, (8) entailment (for example: kill —* die), and (9) causative (for example: eat —* chew).
\Inflectional variants (2) and derivational variants can be derived from the CELEX database content, available from the Centre for Lexical Information, Max Planck Institute for Psycholinquistics, Nijmegen, which can be in the combined lexicon 175. The other lexical relations can be extracted using the separate lexical database 170, such as WordNet. To frame the local context of a word, a number of sentences before and after the current sentence location are evaluated for the presence of lexical links. The number of sentences selected for this operation involves balancing the level of contextual depth to the amount of processing overhead. Using the five sentences before and the five sentences after the current sentence has been found to provide reasonable local context without incurring excessive processing overhead.
After the lexical links have been identified (step 215 a), an importance score for each word in the extracted sentences can be calculated (step 215 b). Lexical links from the current sentence to subsequent sentences are referred to as forward links and those from the current sentence to preceding sentences are referred to as backward links. The importance score, referred to as the context weight, can be computed as follows:
9
1) FonvardWeighl(v>) = Z (WixLi(v))
i = l
9
2) BackwardWeight(w) = £ (WuBmani(w))
» = 1
3) TotalWeight (w) = ForwardWeight (w) + BackwardWeight (w)
max( ForwardWeight (w). Backward (weight (w))
4) Ratio(w) =
Totalweight(w))
5) ContextWeight = Ratio(\v)xTotalWeight(v)
where ForwardWeight(w) computes the weight of forward links, BackwardWeight(w) computes the weight of backward links, TotalWeight(w) represents the sum of all links and Ratio(w) computes a weight for the location of the word. To compute the weight of various lexical links, each type of link is assigned a weighted value according to its relative importance. For example, the nine lexical relations set forth
above were presented in descending order of importance and accordingly can be assigned linearly decreasing weights such as (1,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2). The value ofRatio(w) represents the value assigned based on the location of the word in the original document. For example, when a sentence introduces a topic or ends a topic, it is considered more important and the components of those sentences will be assigned a relatively higher location value.
The use of various types of lexical relations improves the relatedness of a word to the main topic. Although simple relations like repetition and synonymy can be used to determine a measure of contextual importance, these surface relations are generally unable to detect more subtle connections between words.
Following context processing (step 215) the reduction module 135 can perform interdependency processing using a probability analysis based on the corpus 165 of human-written reduction based sentences. Such an analysis can indicate the degree of correlation between components in a sentence, such as the relationship between a verb and its subclause.
The probability computation can be performed based on parse trees using probabilities to indicate the degree of correlation between a parent node and its child nodes in the parse tree. Figure 3 illustrates an exemplary fragment of a parse tree used to explain the operation of the probability computation. In Fig. 3, The main Verb "give" is the parent nodes 300, and it has four children nodes: subclause conjunct 505, subject 310, indirect object 315 and object 320, respectively. The parse tree can ilso include further levels below the children nodes, such as nodes ndet 325 and adjp 330 below child node obj 320 and nodes konj 335 and rconj 340 below node adjp 330, respectively.
To measure the interdependency between the verb give and its subclause 305, the probability that the subclause is removed when the verb is give, can be represented by PROB("when_clause is removed'werb = give). This conditional probability is transformed using Hayes's rule to:
ROBf'v = give']" when dausBremov«tf')*FRDB()"when dausB is removed"
n a similar fashion, the probabilities that a clause will be reduced or remain unchanged can be calculated in a similar manner.
The probability associated with the other child nodes from the current root node is calculated in a similar manner. After the probabilities for each of the first level child nodes is calculated, each of the child nodes in the current level of the tree is then treated as a parent node and the process is repeated through each descending level of the parse tree until every parent-child node pair has been considered. The probabilities for the corpus 165 can be calculated and stored in a look-up table which is used when a reduction module 135 is run.
The context processing of step 215 and probability processing of step 220 provide a relative ranking of sentence components. However, this ranking does not necessarily provide a measure of which components be included to provide a grammatically correct summary sentence. Thus, preferably, after the probability analysis of step 220, reduction processing based on linguistic knowledge is performed (step 225). In this operation, the reduction module 135 works in cooperation with the combined lexicon 175.
The linguistic knowledge processing step 225 operates with the combined lexicon 175 to evaluate the parse tree for each extracted sentence 115 and determine which children nodes are essential to maintain the grammatical correctness of the component represented by the parent node. Linguistic judgments are identified in the parse tree by assigning a binary tag to each node in the parse tree. The value of a tag is either essential or reducible^ indicating whether or not a node is indispensable to its parent node. For example, referring to Figure 3, the lexicon 175 will indicate that the verb give needs a subject and two objects. Thus the child nodes subj 310, iobj 315 and obj 320 can be marked as essential. In this case, the child node subclause 305 is then rendered non-essential and will be marked as reducible. The lexicon 175 can also include collocations, such as consist of or replace.... with...., which prevents removal of indispensable components.
Once the linguistic knowledge processing is applied in step 225, a reduction operation (step 230) can take place. The reduction operation process can be viewed as a series of decision making steps along the edges of a parse tree. BeginningJM
with the root node of the parse tree, the immediate child nodes are evaluated to determine which child nodes can be removed. A child node can be removed if three conditions are satisfied. The first condition is that the component is not a local focus. To determine whether a component is a local focus, the ratio of the context importance score (step 215b) of the child node to that of the root node is calculated. The child node is then considered unimportant if the calculated ratio is smaller than a threshold value. The second condition is that the corpus probability value (step 220) indicating that the special syntactic component of the root is removed is higher than a threshold. The final condition is that the linguistic analysis in step 225 indicates that the child node as reducible.
When the conditions to remove a child node are satisfied, the child node is tagged as "removable" and processing on that branch of the tree terminates. For the child nodes which are retained, the lower levels of the parse tree are evaluated by repeating this process in a similar manner through the tree. The reduction operation step 230 is complete when there are no more nodes to consider. This also concludes processing of the sentence reduction module and results in the parse trees being marked with those components which can be removed or altered by the subsequent paste module 150 operation.
Following processing by the sentence reduction module 135, processing by the sentence combination module 140 is performed. The operation of the sentence combination module 140 is further illustrated in the flow chart of Figure 4.
Using the sentence combination subcorpus!65b, the sentence combination module evaluates the extracted sentence to identify applicable sentence combination operations (step 410). Figure 5 is a table illustrating combination operations such as: add descriptions 510, aggregations 515, substitute incoherent phrases 520, substitute phrases with more general or more specific information 525 and mixed operations 530.
From the sentence combination subcorpus 165b, sentence combination rules are also established to determine whether and how the sentence combination operations of step 410 will take place (step 415). The result is a set of sentence
combination rules 420, such as those set forth in Figure 6. The rules illustrated in Figure 6 are exemplary and non-exhaustive. These sentence combination rules 420 were determined empirically by manual inspection of the sentence combination subcorpus 165b. Using the input article 105 and the extracted sentences reduced by the sentence reduction module 135 the sentence combination module 140 in cooperation with the co-reference resolution module 190 applies the sentence combination rules 420 (step 425). The result of step 425 is that the parse trees of the sentences being combined are appropriately tagged to effect the sentence combination. The combination operation is then realized in step 430 using a tree adjoining grammar (TAG) formalism, as described by A. Joshi, "Introduction to Tree-Adjoining Grammars," in Mathematics of Language, John Benjamins, Amsterdam, 1987. In this way, the sentence combination module 140 performs a paste operation on the marked parse trees and generates a summary sentence.
The document summary is generated by combining the summary sentences. The most straight forward combination is to maintain the order of sentences as they were extracted, however, other sequencing arrangements can also be employed.
As noted above in connection with Figure 1, the corpus decomposition module 185 operates on the corpus 165 to generate the sentence reduction subcorpus 165a and the sentence combination subcorpus 165b. The decomposition module 185 generally operates to evaluate the human written summaries in the corpus 165, compare the summary sentences to the original document, determine if a summary sentence was generated by a cut and past operation and identify where the components of the summary sentences were taken from in the original documents. The operation of the decomposition module 185 is illustrated in the flow diagram of Figure 7.
Referring to Figure 7, the decomposition module 185 uses the human-generated summary and original document as inputs to an indexing operation (step 705). During indexing, each word in the original document is indexed according to its positions in the original document. A convenient way of referencing these occurrences is by sentence number and word number in the original document.
To evaluate the index of words, a set of heuristic rules is developed by manual inspection of the corpus 165. Such inspection reveals that human-generated summaries often include one or more of six operations: sentence reduction, sentence combination, syntactic transformation, lexical paraphrasing, generalization/specification, and content reordering. The heuristic rules can be represented using a bigram probability PROS (W2 = (S2, W2) \ W, = (S,, W,)) (abbreviated as PROB(W2\W,} in the following discussion). The probability values can be assigned in the following manner:
•IF ((S, - S2)and(Wi -W2- 1)) (i.e., the words are in two adjacent positions in the document), THENPROB(W2\ W,) is assigned the maximal value, PI .(Rule: Two adjacent words in the summary are most likely to come from two adjacent words in the document.)
•IF ((S, = S2)and(W, •IF((S, = Si)and(W, > WJ), THENPROB(W2\ W,) is assigned the third highest value, P3. (Rule: Adjacent words in the summary are likely to come from the same sentence in the document but reverse their relative orders, such as in the case of sentence reduction with syntactic transformations.)
•IF(S2 - CONST •IF(S2 •IF(\S2-S,\* CONST) THENPROB(W2\ W,) is assigned a small value, P6. (Rule: Adjacent words in the summary are not very likely to come from sentences far apart.)

Based on the above heuristic principles, a Hidden Markov Model can be generated, such as is illustrated in Figure 8 (step 710). The nodes in the Hidden Markov Model represent possible positions in the document, and the edges output the probability of going from one node to another. This Hidden Markov Model is used in finding the most likely position sequence in a subsequent processing operation. Assigning values to P1-P6 is performed empirically. For example, the maximal value can be assigned 1 and others are assigned evenly decreasing values 0.9, 0.8 and so on. The order of the above rules is based on the empirical observations on a particular set of summaries. These values, however, can be adjusted or even trained for different corpora.
AViterbi algorithm can be used to evaluate the Hidden Markov Model and find the most likely sequence of words incrementally (step 715). The Viterbi algorithm first finds the most likely sequence for (Word,Word2), for each possible position of Word2. This information is then used to compute the most likely sequence for (Word,Word2Word3), for each possible position of Word3. The process repeats until all the words in the sequence have been considered.
After evaluation by the Viterbi algorithm, post-editing operations can be used to cancel mismatches that occur in the corpus analysis. The result is that summary sentences are matched to the corresponding phrases in the document. Once the summary sentences are so matched, it is a simple endeavor to sort the various matchings to one of the sentence reduction subcorpus 165a and sentence combination subcorpus 165b. In addition, the decomposition module 185 can be used as a stand alone tool, apart from the rest of the present summary generation system, to perform various summary analysis operations.
Although the present invention has been described in connection with specific exemplary embodiments, it should be understood that various changes, substitutions and alterations can be made to the disclosed embodiments without departing from the spirit and scope of the invention as set forth in the appended claims.

We claim:
1. A system for generating a summary of an input document, said system comprising:
. a) an input unit-7a inputof documents, said input unit contained in a plurality of
client devices;
b) a plurality of servers connected to said plurality of client devices and storing the
input documents and a combined lexicon;
c) a central processing unit contained in said plurality of servers and
communicatively connected to the input unit, said central processing unit
having an extraction unit, a summary sentence generation unit and a
grammatical parser unit;
d) said extraction unit connected to the input unit and receiving the input
documents and extracting at least one sentence related to a focus of the
document;
e) said summary sentence generation unit in operatively coupled to said extraction
unit;
f) said grammatical parser operatively coupled to the generation unit for parsing
the extracted sentences into components in a grammatical representation; and
g) said combined lexicon operatively coupled to the generation unit; and
a corpus of human generated summaries operatively coupled to the generation
unit;
h) a communication network connected to the said plurality of servers, plurality of
client devices; and i) a user interface means located in the said client device providing a user interface
by connecting the said client device with the communications network and a
connection means for verifying the integrity of the client device connection with
the communication network.
2. The system as claimed in claim 1, wherein the generation unit consists of a sentence
reduction unit.
3. The system as claimed in claim 2, wherein the said sentence reduction unit is
cooperatively engaged with the corpus and performs probabilistic importanceprocessing on the components of the grammatical representation in accordance with the corpus.
4. The system as claimed in claim 3, wherein the sentence reduction unit is cooperatively
engaged with the combined lexicon and performs context importance processing on the
components of the grammatical representation and computes the relative importance of
the components based on linguistic knowledge stored in the combined lexicon.
5. The system as claimed in claim 4, wherein the context importance processing consists
of establishing a plurality of lexical links of at least one type for the components and
generating a context importance score based on the type and number of links associated
with the components.
6. The system as claimed in claim 1, wherein the generation unit consists of a sentence
combination unit.
7. The system as claimed in claim 6, wherein the sentence combination unit is operatively
coupled to the corpus, said sentence combination unit identifies at least one sentence
combination operation; establishes at least one rule for applying the sentence
combination operation; and applies the at least one rule to combine at least two
extracted sentences.
8. The system as claimed in claim 7, wherein the at least one sentence combination
operation is selected from the group consisting of add descriptions, aggregations,
substitute incoherent phrases, substitute phrases with more general or more specific
information, and mixed operations.
9. The system as claimed in claim 8, wherein the at least one rule to combine extracted
sentences comprises replacing a partial name phrase with a full name phrase.
10. The system as claimed in claim 9, wherein the at least one rule to combine extracted
sentences comprises determining if two sentences having a common subject are
proximate and whether at least one sentence is marked for reduction then removing the
subject of the second sentence and combining with the first sentence using the
connective"and".
11. The system as claimed in claim 1, wherein the generation unit comprises a sentence
reduction unit and a sentence combination unit.The system as claimed in claim 11, wherein the sentence reduction unit is cooperatively
engaged with the combined lexicon and performs context importance processing on the
components of the grammatical representation and computes the relative importance of
the components based on linguistic knowledge stored in the combined lexicon.
12. The system as claimed in claim 12, wherein the context importance processing
comprises establishing a plurality of lexical links of a least one type for the components
and generating a context importance score based on the type and number of links
associated with the components.
13. The system as claimed in claim 13, wherein the sentence reduction unit is cooperatively
engaged with the corpus and performs probabilistic importance processing on the
components of the grammatical representation in accordance with the corpus.
14. The system as claimed in claim 10, wherein the sentence combination unit is operatively
coupled to the corpus and wherein the sentence combination unit identifies at least one
sentence combination operation; establishes at least one rule for applying the sentence
combination operation and applies the at least one rule to combine at least two
extracted sentences.
15. The system as claimed in claim 15, wherein the at least one sentence combination
operation is selected from the group consisting of add descriptions, aggregations,
substitute incoherent phrases, substitute phrases with more general or more specific
information, and mixed operations.
16. The system as claimed in claim 16, wherein the at least one rule to combine extracted
sentences comprises replacing a partial name phrase with a full name phrase.
17. The system as claimed in claim 17, wherein the at least one rule to combine extracted
sentences comprises determining if two sentences having a common subject are
proximate and whether at least one sentence is marked for reduction then removing the
subject of the second sentence and combining with the first sentence using the
connective"and."
18. The system as claimed in claim 1, the corpus is operatively coupled a decomposition
unit, the decomposition unit analyzing the corpus and generating a sentence reduction
subcorpus and a sentence combination subcorpus.20. A system for generating a summary of an input document as herein substantially described with respect to accompanying figures.

Documents:

in-pct-2001-737-del-abstract.pdf

in-pct-2001-737-del-assignment.pdf

in-pct-2001-737-del-claims.pdf

in-pct-2001-737-del-correspondence-others.pdf

in-pct-2001-737-del-correspondence-po.pdf

in-pct-2001-737-del-description (complete).pdf

in-pct-2001-737-del-drawings.pdf

in-pct-2001-737-del-form-1.pdf

in-pct-2001-737-del-form-19.pdf

in-pct-2001-737-del-form-2.pdf

in-pct-2001-737-del-form-26.pdf

in-pct-2001-737-del-form-3.pdf

in-pct-2001-737-del-form-5.pdf

in-pct-2001-737-del-pct-210.pdf

in-pct-2001-737-del-pct-220.pdf

in-pct-2001-737-del-pct-301.pdf

in-pct-2001-737-del-pct-304.pdf

in-pct-2001-737-del-pct-308.pdf

in-pct-2001-737-del-pct-332.pdf

in-pct-2001-737-del-pct-401.pdf

in-pct-2001-737-del-pct-402.pdf

in-pct-2001-737-del-pct-408.pdf

in-pct-2001-737-del-pct-409.pdf

in-pct-2001-737-del-pct-416.pdf

in-pct-2001-737-del-petition-137.pdf

in-pct-2001-737-del-petition-138.pdf

« Previous Patent

Next Patent »

Patent Number

220516

Indian Patent Application Number

IN/PCT/2001/00737/DEL

PG Journal Number

30/2008

Publication Date

25-Jul-2008

Grant Date

29-May-2008

Date of Filing

20-Aug-2001

Name of Patentee

THE TRUSTEES OF COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK

Applicant Address

Inventors:

#	Inventor's Name	Inventor's Address
1	MCKEOWN, KATHLEEN, R.
2	JING, HONGYAN

PCT International Classification Number

G06F 17/10

PCT International Application Number

PCT/US00/04118

PCT International Filing date

2000-02-22

PCT Conventions:

#	PCT Application Number	Date of Convention	Priority Country
1	60/120,657	1999-02-19	U.S.A.