4. Methodology
The methodology uses semantic, linguistic, and factual information from ResearchCyc and WordNet to process web queries. The two major aspects of the methodology are query expansion and query refinement. During query expansion, the query is “expanded” with new terms to improve the retrieval performance. The expansion is usually performed using synonyms of the initial query terms Qw. Query expansion also takes into account the actions or properties of the initial terms (non-taxonomic relationships) and instances (in some cases). For example, the query “Pet” may be expanded as Pet and Animal.
Query refinement is the incremental process of transforming a query into a new query that more accurately reflects the user’s information need [17]. The goal is not to obtain better results but to change (shrink/grow) what the user are looking for (the expected result). To do so, the user is asked to disambiguate the query after which, the query may be reformulated automatically using the semantic knowledge of the ontology. This process uses generalization or specialization relationship types as well as non-taxonomic relationships and instances. For example, the query “Pet” may be refined by asking the user:
You are interested in Pets, but are you interested in any activity related with Pets? (1) buying/selling Pets, (2) Pet Stores in your area, (3) Providers of Animal Therapy
Supposing that the user is interested in Pet Stores in his/her area and he/she lives in Atlanta, then using the ResearchCyc knowledge the query may be automatically redefined as: “Pet Store” and Atlanta and Georgia and buy.
The proposed methodology consists of four phases: (a) Query Parsing, (b) Query Expansion, (c) Query Refinement, and (d) Query Submission. The query parsing phase involves parsing the natural language query using POS tagging and identifying the types of terms: nouns (and noun phrases), verbs, adjectives, adverbs, etc. These terms form the initial query. The query expansion phase adds similar terms to the query and negative knowledge as appropriate. The user identifies the correct word sense and the other word senses are added as negative knowledge since the user is not interested in them. The query refinement phase reformulates the query to better focus on the necessities of the user. This is accomplished by using the taxonomic and non-taxonomic relationships in ResearchCyc. The query submission phase creates the final query according to the syntax required by the search engine used and submits the query and provides the results back to the user. The steps in the methodology are given below. After each step, the user is asked if the query reflects his/her intension. If so, the final query is constructed using the appropriate syntax and submitted to the search engine. The steps in the methodology are summarized in Table 3.
Steps of the methodology
| Phase | Step | Description | Knowledge used | Result |
| Query parsing | 1 | Query is parsed using POS Tagger in order to identify the terms used | None | A set of query terms (t1, … , tn) that will be used as the initial query |
| Query expansion | 2 | The concepts in ResearchCyc that represent the query terms are identified | Linguistic information from ResearchCyc (use of WordNet is also recommended due to limited linguistic information contained in ResearchCyc) | A set of the ResearchCyc concepts (c1, … , cm with m n) relevant to the query |
|
| 3 | There may be more than one concept ci for each term tj. This step finds the appropriate word sense of each term in the context of the query. This process is mostly manual, although heuristics are applied to disambiguate terms automatically in some cases | General information about the concepts is used, such as description and sub and supertypes To disambiguate concepts automatically, disjoint constraints, generalization/ specialization relationships, and general relationship types are used | Two sets: (1) Composed of one concept per query term (each concept representing the relevant meaning of a term), and (2) Composed of the concepts that represent the discarded senses |
|
| 4 | Extends the query with other concepts closely related to the query concepts | Some generic relationship types of ResearchCyc are used to identify the elements closely related to each concept ci | A set of concepts (cl1, … ,clk) closely related to the concepts of the query |
| Query refinement | 5 | Several refinements of the query are identified and presented to the user; user may proceed with the initial query or choose a refinement and generate a refined query | The 82 part of relationship types and other relationship types that denote semantic closeness between concepts are used to identify possible refinements for the query concepts | If the user has selected any refinement, then the output is one ResearchCyc concept for each of the query terms |
| Query Submission | 6 | Construct the final Boolean query using appropriate syntax | Linguistic information is used. The denotation words of the selected and discarded concepts are incorporated into the query using the search engine syntax. | A string that represents the final query |
|
| 7 | Submit query to the search engine and provide the results back to the user | None | Results of query execution |
The following example shows how concepts in ResearchCyc can be used to reason about query terms and to select an appropriate sense and terms to add to the query. Suppose the user wants to know the places to drink mate, which is a kind of tea frequently drunk in Argentina, and writes the query “drinking mate in Barcelona”. In the first step, the query is parsed and the output is the set of initial query terms, namely, drinking, mate, and Barcelona. The word drinking has three senses in ResearchCyc: Alcoholic beverage, Drink as a noun and Drink as a verb (the act of drinking). The word mate has three senses with Paraguayan tea identified from Wordnet because it is not defined in ResearchCyc. We use the links between the supertypes of mate in Wordnet and ResearchCyc to identify that tea (a supertype of mate) is related to the concept Tea-Beverage in ResearchCyc. Finally, Barcelona has only one sense: city of Barcelona. Thus, the result of the second step for the presented query is:
{{AlcoholicBeverage, Drink, DrinkEvent},{partner, Tea-Beverage},{CityOfBarcelona}}.
The above query has three senses for the word drinking. In the third step of the methodology, the second sense Drink is automatically discarded because it is the supertype of the sense AlcoholicBeverage of the same word. Here, we need user interaction to identify that we are interested in the activity of drinking instead of the alcoholic beverage. Therefore, the concept DrinkEvent is selected as the appropriated sense for the first query term. Note that the appropriate sense of the second word “mate” may be inferred automatically because Tea-Beverage is related to a particular sense of the other two words. Specifically, Tea-Beverage is related with DrinkEvent because Tea-Beverage is a subtype of Drink, and Drink is related with DrinkEvent with a relationship that denotes that the action of drinking involves consuming a drink. Tea-Beverage is related with cityOfBarcelona because it is an instance of City, which is a subtype of Place; Place is related with Event-Localized, which is a supertype of DrinkingEvent, with the relationship type EventOccurs. Hence, the other senses are automatically discarded from the second word of the query and added as negative knowledge. Therefore, the third step returns the following two lists:
{DrinkEvent, Tea-Beverage, CityOfBarcelona}
{{AlcoholicBeverage}, {partner}, { }}
The first list denotes the relevant meaning for the query while the second list represents the discarded senses. In the fourth step of our example, the query should be expanded with the concept Spain because there is a relationship called CountryOfCity that relates CityOfBarcelona with the concept Spain. Hence, the concept Spain has been added to the list of relevant concepts. Assume that the user does not select any refinement in this example (fifth step). Finally, the sixth step returns the following query:
“Drinking mate tea Barcelona Spain – alcoholic – love”3
5. Prototype architecture and implementation
The methodology has been implemented in a prototype using J2EE technologies. The prototype interfaces with Google and AlltheWeb search engines. The query expansion module and the query refinement module interact with ResearchCyc through its Java API [5], which is used for querying the concepts of ResearchCyc and making inferences about the concepts related to user query terms.
The architecture of the prototype is shown in Fig. 2 and consists of two parts, the client side and java-enabled server side. The client is a web browser that presents the web pages created in the server side to gather information from the user and present the query results. The server side contains four major components: (1) query parser module, (2) query expansion module, (3) query refinement module, and (4) query generation module.
| Full-size image (71K) |
Fig. 2. System architecture.
The Query Parser Module captures the user’s query, parses it with QTag parser (http://www.english.bham.ac.uk/staff/omason/software/qtag.html) and returns the part-of-speech for each term. From this, a baseline query is created. The Query Expansion Module interfaces with ResearchCyc knowledge sources and supports the query expansion steps. For each query term, it obtains the related concepts and the synsets and lets the user select the appropriate word sense to use. Based on the user’s input, appropriate synonyms and negative knowledge are added to the query. In some cases, the word sense can be identified automatically when some of the terms in the query are relationship types that relate to other concepts in the query for only one of the possible senses. The Query Refinement Module interfaces with ResearchCyc and adds personal information that is relevant to the query to restrict the search domains. Based on the user’s selected synset, hypernyms and hyponyms for the selected sense of the term are obtained from ResearchCyc and WordNet. This module uses taxonomic and non-taxonomic relationships from ResearchCyc to reason about concepts and to propose appropriate refinements to the query. When the user chooses to refine a query, the new query is sent back to the query expansion module because the refined query can be expanded with new information and further refined before being executed. The Query Generation Module creates the augmented query using the appropriate syntax for the search engine. Boolean operators are used to construct the final query and adequate care is taken to ensure that the final query meets the syntax requirements. The Search Engine Interface enables the final query to be submitted to various search engines and forwards the results back to the user.
The prototype is implemented as a web application using JSP (Java Server Pages). This development environment was chosen because it would make the system portable and easily accessible through the World Wide Web. On the client side, web pages are used to gather information from the user, such as the initial query, the user selection of the relevant senses for the query terms and the refinements to apply. On the server side, several modules have been implemented using java servlets. These modules are used to parse the query, identify the correct senses of the query terms in the ontology, identify other concepts from the ontology which are also relevant to the query, and identify possible query refinements. The query expansion module and the query refinement module interact with ResearchCyc through its Java API [5]. The query creation module interfaces with Google and AlltheWeb search engines.
This section illustrates how our system works using a sample query. Assume that a user from Atlanta (Georgia) is looking to buy forks. Therefore, the user may pose the query “buying fork Georgia”. If we execute such a query in Google only one of the first 10 results is relevant to the user.
In our system, the user would type the query in the initial web page (Fig. 3) and click on the “Parse Query” button. The query is sent to the server and parsed to identify the query terms (the nouns and verbs contained in the query among others). The query terms in the example are {buying, fork, georgia}.
| Full-size image (56K) |
Fig. 3. Initial web page for specifying the query.
The query expansion module identifies the ResearchCyc concepts that are linguistically related to the query terms. If more than one concept is related to a single query term then the system creates a web page and sends it to the user (Fig. 4). This web page shows the different meanings of the ambiguous query terms and allows the user to choose the correct sense. In our example, the term Georgia has three different meanings in the ontology: the University of Georgia, the state of Georgia in the US and the country of Georgia in Europe. Then, the user selects the meaning Georgia-State and clicks on the “Query Expansion and Refinement” button to continue. Due to the incompleteness of the ontology, an option called “none of the previous senses” has been added to the web page. In the event that none of the senses defined in the ontology fits with the query term in the context of the query, the user can select this option.
| Full-size image (81K) |
Fig. 4. Disambiguation web page.
Based on the user’s selection, the query expansion module identifies the concepts that can be used to expand the query. At this point, only geographical information and the part of relationship types are used to identify these expansions. Since Georgia-State is a state of the US, the query is expanded with the term the United States. The query generation module creates the final query using the denotations of the correct meaning of the query terms and their expansions. The discarded meanings are added as negative information to the query. The resultant query is
fork georgia buying – “the university of georgia” – “the republic of georgia” the United States
Next, the system identifies the possible refinements of the query by studying the knowledge related to the relevant concepts. When all the possible refinements have been identified, the system creates a web page that contains the created query and these refinements (Fig. 5). In this example, the system presents seven refinements for the term fork (Kitchenware, eating, utensil organizer, hand, grip, control, action with only one performer), three refinements for Georgia (Governor, US Governor, state official) and two for Buying (drug trafficking4 and invest in hedge fund).
| Full-size image (135K) |
Fig. 5. Query refinement web page for initial query.
In the web page shown in Fig. 5, the user has two options:
1. Generate the final query: If no refinements have been done, the user can generate the final query by clicking “Construct Final Query”. Then the final query will be created and presented to the user using the syntax of Google and AlltheWeb search engines. The user can click on the corresponding button for the search engine that he or she wants to use. Since the query is presented in a text box, the user can modify it before submission. The results of the query are directly presented by the search engine.
2. Refine the query: The user can check the proposed refinements to see if any of them better represents his or her needs. Suppose that in our example, the user chooses the kitchenware refinement for the term fork, the query will be refined by substituting the term fork (and its negative and expanded knowledge) with the term kitchenware. Since the user is neither interested in drugs nor in politics, he or she does not need to select further refinements for the other two terms Georgia and buying. When refinements are selected, the “Construct Final Query” button will be changed to “Refine & Expand Query”. When the user clicks on this, the query is modified to “buying kitchenware Georgia” and sent to the query expansion module again because the new query terms (kitchenware) may involve additional expansions. Since the user may also be interested in the refinements of the new query, the possible refinements are created and presented to the user in a web page.
Assume that, after this refinement, the user agrees with the expanded query and therefore executes it by clicking on the “Construct Final Query” button. Then the final query is created and the user can execute it using Google or AllTheWeb search engines. Our system greatly improved the relevance of the results for this query, since there were eight relevant Web pages in the first 10 results returned for the refined query compared to only one from the initial query.
Sample queries have been executed with the results shown in Table 4. The base query and the expanded query were executed in Google and the number of relevant hits in the top 10 results were identified (Relevance Score). These sample queries show that the addition of semantic and linguistic knowledge helps improve query results.
Query results
| Base query | Relevance score (Google) | Expanded query | Relevance score (our method) |
| Flute bohemian drink | 1 | (Flute OR champagne flute ) Bohemian (Drink OR beverage) – woodwind – drinking | 4 |
| Blues Suicide | 3 | Suicide (Blues OR depression) – blues music – the blues style of music | 10 |
| Find cookie stores | 3 | Cookie (stores OR retail store ) (Find OR encountering) – http cookie - http cookie – storing – retail space – fund – conscious activity | 10 |
| Pirates punishment | 0 | punishment ( Pirates OR pirate) – pirating – buccaneer – whitworth college | 4 |
| Coach agency rules | 1 | (agency OR organization) Coach ( rules OR code of conduct ) – federal agency – coaching – bus – governing – ruler – principle | 5 |
| Image download sunset | 7 | Wallpaper sunset download – sundown | 10 |
| Monkey virus | 4 | Monkey virus – computer virus | 6 |
| Buccaneer history | 8 | History piracy | 10 |
In the first query the user searches information about bohemian flutes, which are glasses made in the bohemian region (Czech republic). This query returns a lot of information about flutes (the tube-shaped musical instrument). In the second query, the user is interested in the relationship between suicide and sadness emotions of people. In the third query, the user is interested in information about biscuit stores. Executing this query using a search engine yields irrelevant results because a cookie is also a web technique used for storing information in a user’s computers. The fourth query returns web pages about the sailors that performed piracy during the 17th, 18th and 19th centuries. However, the user is interested in the punishments for piracy, the illegal copying of copyrighted material. The fifth query returns a great deal of irrelevant information regarding vehicles and traveling. In the expanded query the possible meanings of the term coach is narrowed down and therefore the query results are improved. In the sixth query, the user wants to download images of sunsets in order to put them as wallpapers in his or her computer. Since our tool allows refining the term image to wallpaper, some of the results that dealt with screensavers can be discarded. The seventh query does not produce good results because the phrase monkey virus is ambiguous. There is a computer virus and a biological virus with the same name. In the refined query, the user indicates a preference for the biological virus sense. In the last query, the user is looking for the history of buccaneers. Some of the results obtained deal with the football team Tampa Bay Buccaneers. In all of the above queries, the refinements proposed by our system improve the query results.
在第一个查询中,用户搜索关于bohemian flutets,在bohemian地区是玻璃制造的。这个查询返回许多关于笛子的信息。第二个查询中,用户搜索自杀和心情难过之间的联系。第三个查询中,用户想搜索饼干店。使用搜索引擎执行查询,包括不相关的结果,因为cookie也是用来存储信息在用户电脑里的一个网页技术。第四个返回关于17,18,19世纪海盗的信息。然而,用户是对盗版(对有版权的资料进行非常复制)的处罚有兴趣。第五个查询词返回大量关于交通工具的无关信息,在查询词扩展中,coach的可能意思缩小了,因此搜索结果得到改善。在第六个查询词里,用户查下载日出的照片来作为电脑的壁纸。因为允许我们把image优化成wall-paper,所以关于屏保的结果会被去掉。第七个查询词不会查询出好的结果,因为monkey virus是模糊的,有电脑病毒和生物病毒两种意思。在查询优化里,用户指出对生物病毒这个意思有兴趣。在最后一个查询词里,用户想查询海盗的历史,所得结果中有一些关于足球队Tampa Bay Buccaneers的信息。在上面所有查询中,我们系统提供的优化改善了查询结果。
本文介绍了一种利用ResearchCyc和WordNet的知识来增强网络查询的方法。该方法通过查询扩展和查询优化来提升检索效果,更好地满足用户需求。
2611

被折叠的 条评论
为什么被折叠?



