Improving web-query processing through semantic knowledge and user feedback

7. Discussion

The use of ResearchCyc has been successful for improving web querying. However, the quantity of information contained in ResearchCyc and the lack of mechanisms for accessing ResearchCyc in an efficient manner made it difficult to work with. Several difficulties arose when integrating ResearchCyc into the prototype.

• Documentation: There is little documentation that identifies different classes within the API and how to use them. In some cases, the only way to identify relevant functions is to examine the source code of the API. For example, although there may be a function in the API that returns the concepts that represent the meanings of term (getDenotationsOf), it is not defined in the API documentation. Therefore, one must inspect the source code of the API (the class CycAccess) to discover its existence.

• Operating Environment: There are problems running ResearchCyc on different operating systems. Specifically, we were unable to run the API under Windows XP or Debian GNU/Linux.

Other problems arose that were based on the type of knowledge and the level of detail of its description. First, there is a lack of domain information. Although ResearchCyc is a huge ontology, there are many domains that it does not cover. For example, in the query “drink mate Barcelona” the user is interested in a tea originally from Paraguay. Unfortunately, ResearchCyc does not contain a concept that describes such a tea and therefore other ontologies must be used to fill this gap.

Second, there is a lack of linguistic information. For example, ResearchCyc does not have well-defined denotations of all the concepts; only a few synonyms and antonyms are represented. Third, ResearchCyc does not always use terminology that is the most commonly used in the real world. The results in misaligned queries that are difficult to execute. Fourth, ResearchCyc does not distinguish how close different pieces of extensional information are to a given term. When a term has different senses, the user identifies the correct sense to use and irrelevant senses are used as negative information in the query. This heuristic may cause problems when applied to extensional knowledge. For example, suppose we are interested in knowing the hobbies of the former US president Bill Clinton and therefore create the query “Clinton hobbies”. There are several instances in ResearchCyc that have Clinton as surname: William Clinton, Hillary Clinton and Chelsea Clinton. If we choose William Clinton, the rest of the instances are used as negative knowledge. This will eliminate web pages that have his wife or daughter’s name, even though they may contain information about the hobbies of Bill Clinton.

Hence, ResearchCyc is most effectively used with other knowledge sources such as WordNet to improve web querying.

8. Related work

Since traditional search engines do not deal with any domain knowledge, they do not understand the meaning of user’s query and the inherent relationships between its terms [19]. Present search engines are trying to overcome these problems in different ways. One way is to deduce the context of the web query from its query terms. For example, if the query contains a U.S. street address or a city, the search engine can provide direct links to maps and several web pages related to the city. As web search becomes a more important function within society, the need for better search services is becoming increasingly important [20].

Due to the diversity in content and structure of the web, novel techniques are needed to create more focused queries. Some query expansion and refinement techniques use conceptual fuzzy sets [21]. Although several approaches to improve web queries through query expansion are reported in the literature, we discuss only the ones closest to our approach, specifically [11] and [5].

Zhang et al. [11] present an approach that disambiguates user queries by analyzing the “relationship” context associated with query concepts. Their method uses a domain ontology to support the whole activity. The paper presents a set of pattern queries that try to represent the different ways in which the relationships and concepts can be posed in a query. The presented methodology is somewhat similar to ours in that it identifies relevant concepts and relationship types involved in the context of the query. However, this approach has several weaknesses:

(a) The patterns they present may be useful to identify the concepts and the relationship types from written text, however, they may not work well in web queries because they tend to be short and the terms may appear in any order.

(b) Their approach is context (or domain) dependent and may not be generalizable. The query context detection is driven by the context of the ontology. In other words, the context of the ontology subsumes the context of the query. This will cause scalability problems because queries may deal with more than one domain.

(c) The manual intervention needed in their methodology is quite high. In large domain ontologies such as UMLS, the number of applicable relationship types between two given terms may be quite high. Hence, manual intervention is needed since the order of the concepts and relationship types cannot be determined automatically.

(d) Due to the lack of good ontologies in most domains, the approach can be successfully applied in only few domains.

In [5], Burton-Jones et al. present a methodology for context aware query processing on the Web. Their methodology enhances the semantic content of Web queries using two complementary knowledge sources: lexicons and ontologies. The methodology constructs a semantic net using the original query as a seed, and refines the semantic net with terms from the two knowledge sources. The enhanced query, represented by the refined semantic net, can be executed by search engines. Their empirical evaluation shows that queries suggested by their system produce more relevant results than those obtained by the original queries. Their work demonstrates the use of existing knowledge sources to enhance the semantic content of Web queries.

This approach is very similar to ours; however, it has some limitations. One of the issues with this approach is the way in which the final query is constructed. In some search engines, such as Google, the order of the terms in the web query determines greatly the results of the query. For that reason it is possible that their extended query may result in more irrelevant web pages, even though the extended query is more focused on the user’s intents. Another limitation is the use of DAML Library, which contains a number of classes dispersed among different ontologies. The number of interconnections between concepts (such as the number of relationship types per concept) is lower than the number of interconnections that exists in our support ontology, namely, ReseachCyc. Therefore, less manual intervention is involved in our approach. Furthermore, another difference is that [5] does not include refinement activity.

Some query refinement methods have not been successful because they are syntactic based [7]. However, some semantic refinement approaches are beginning to appear. The closest one to our approach is by Hartmann et al. [7], which defines an architecture that realizes semantic-based search to access information represented in semantic portals (called SEAL). Their approach uses query refinement techniques to refine the logical queries provided for the users. They present a set of heuristics that direct the query refinement. The main difference between this approach and ours is that the search is performed on a knowledge base instead of the web. The main drawbacks of this method for applying it to the web context are: (1) since the ontology used is in OWL-DL, it cannot take advantage of general constraints and heuristics; (2) some of the assumptions made in this work do not hold in the case of web search; (3) the support ontology used cannot have multiple inheritance. This is a serious drawback, since most large and meaningful domain ontologies use multiple inheritance; and (4) this approach is specific to the SEAL technology and its generalizability to the web context is unclear.

Another approach that uses query expansion and refinement, but with different goals, is [6]. It focuses more on how to create a small lattice of concepts to support query expansion and less on the methodology to use such a lattice. On the other hand, our approach focuses on how to group and use the knowledge contained in large knowledge bases to improve the query meaningfully and efficiently.

9. Conclusion

A methodology for employing ResearchCyc, a body of semantic knowledge about various application domains, has been presented. The methodology is based on prior research on semantic, linguistic, knowledge repositories. The methodology has been implemented in a prototype and applied to web queries. The preliminary results from the prototype are encouraging, however, further validation using different types of queries and domains is required to provide more conclusive evidence. Further work is also needed to determine the circumstances under which the approach may not yield good results.

This research contributes to the improvement of web queries in several ways. First, it demonstrates that semantic and linguistic knowledge together improve query expansion. Second, the research identifies and formalizes web-query problems and presents a query classification scheme that explains why, in some cases, query expansion may not be successful, even if the repository used to support such a task is complete. Third, the research illustrates that an ontology structure should contain concepts and generalization relationships to represent both intensional and extensional information. The inclusion of heuristics and integrity constraints facilitates inferencing on new information to create intelligent queries.

Although ResearchCyc contains useful knowledge for supporting web queries, it is most effective when augmented with linguistic information from WordNet and factual information from the World Wide Web [14] and [15]. Future research will concentrate on expanding the heuristics of the methodology and automating different aspects of it (e.g. sense disambiguation). It will also examine the feasibility of integrating ResearchCyc with other ontology libraries and with other search engines.

７．讨论

使用ResearchCyc成功的改善了网页查询，但是，由于库中的包括的信息数量巨大和缺少一种机制有效的访问库，使得工作很难进行。当把ResearchCyc整合到原型中时，出个困难就出现了。

l 帮助文档：API里几乎没有识别几同类和怎么使用它们的文档。在一些情况下，要了解相关的函数只有检查API的愿代码。例如：尽管API中有一个函数是返回词项的概念，但是在API的帮助文档里没定义。因此，必须检查原代码来确定它的存在。

l 操作环境：在不同操作系统里运行ResearchCyc也存在问题，特别的，在winxp和Ｌinux下无法运行ResearchCyc.

在知识类型和描述的不同详细水平也出现问题。首先，缺少领域信息。尽管ResearchCyc是一个大的本体，但是也有很少领域没覆盖到。例如，要查询词里“drink mate Barcelona”里，用户对来自巴拉圭的这种茶感兴趣，但是ＲesearchCyc没有包括描述这种茶的概念，因此需要其它本体来填补这个空白。

第二，缺少语言信息。例如，ResearchCyc里没有定义好的关于所有概念的外延，只是提供了几个同义词和反义词。第三，ResearchCyc没有使用现实世界中经常用的术语，在模糊的查询词里面，结果是很难得到的。第四，ResearchCyc没有区分一个给定词项的不同外延信息关系的密切程度。但一个词项有不同意思，用户必须确定一个查询词中正确的意思和用怎么消极信息的不相关意思。当运用外延知识时这种启发式会出现问题。例如：假设我们想知道美国前总统克林顿的业余爱好，必须创建了个查询词“Clinton hobbies”。在ResearchCyc中有几个实例都是以clinton为姓的：william Clinton，Ｈillary　Ｃlition和Chelsea　Clinton。如果我们选择William Clinton，其它实例将作为消极信息。这就导致丢失了关于他妻子和儿子名字的网页，因为这些网页里面有可能包括Bill Clinton的业余爱好。

因此，ResearchCyc使用其它知识来源如WordNet，就能更好的提高网页查询。

８．相关工作

由于传统的ＳＥ不能处理任何领域的知识，他们不理解用户查询词的意思和词项之间潜在的联系，目前的ＳＥ试图以不同的方法来克服这些问题。一个方法是从词项中推导出网页查询的上下文环境。例如：查询词中包括美国街道地址或者一个城市街道地址，ＳＥ会直接连接到关于该城市的地图和相关网页。随着网页查询在社会中越来越成为一种重要的功能，好的搜索服务也越来越重要。

文献１１中，提出了通过分析查询词概念中相关性来消除歧异。他们的方法是使用一个领域本体来支持整个操作。这篇文章提出一个模板查询词来表示把关系和概念加到查询词中的不同方式。该方法跟本文有些相似，因为他识别关于查询词环境中的相关概念和联系，但是，该方法有几个缺点：

（a）他们提出的模板在识别文本中的概念和相关性是有作用，但是，不能运用于网页查询方面，因为网页查询的查询词往往比较短，词项可以任何顺利出现。

（b）它们的方法是依赖于环境，不能被泛化。查询词环境检测是通过本体的环境推动的。换句话说，本体环境包括查询词环境。这就导致可扩展性问题，因为查询词可能涉及到多个领域。

（c）它们方法中的人工干涉太多。在大型的领域本体比如UMLS中，两个给定的词项中的应用联系类型的数目会相当高。因为概念和关系类型的顺序不能自动检测，所以必须用人工干涉。

（d）因为在大多领域缺少好的本体，这个方法不能很好的应用到很少的领域中。

第文献５中，Burton-Jones等人提出了在网页上进行背景发现查询的方法。它们的方法使用两个补充的知识来源：词典和本体，来增强网页查询的语义内容。该方法使用一个初始的查询词作为种子来建立一个语义网，然后用两个知识源中的词项来优化这个网。用语义网优化了的增加查询词可以在ＳＥ中执行。他们的经验评价显示，由系统提出的查询词比初始查询词能产生更多的相关结果。他们的工作显示了使用现有的知识源来增强网页查询的语义内容。

这个方法跟本文提出的方法非常相似，但是它有一些局限。其中一个问题就是最终查询词创建的方式。在一些ＳＥ中，比如Google，在查询词中词项的顺序很大的决定了查询词的结果。由于这个原因，他们的扩展查询词有可能产生更多的不相关的网页，尽管扩展查询词更集中在用户的目的上。另一个局限是使用DAML库，它包含大量分散在不同本体的类。概念之间互相联系的数目少于我们的支持本体中的相互联系的数目，比如ResearchCyc。因此，我们的方法中涉及了很少的人工干涉。另外一个不同就是该方法不包括优化步骤。

一些优化方法不能成功使用是因为他们是基于句法的。但是一些语义的优化方法已经开始出现了。跟我们方法最接近的是文献７中的方法，定义了一个结构来实现基于语义的搜索来访问语义门户网站的信息。他们的方法使用查询词优化技术来优化用户提供的逻辑查询词。他们提出了一系统启发式来控制查询词优化。他们的方法跟我们的方法只要不同是搜索是基于知识库而不是基于网页的。这个方法在网页环境实现的主要不足是：（１）由于使用的本体是在OWL-DL里面，它不能利用一般的约束和启发。（２）它们的方法中，一些假设并没出现在网页搜索里面。（３）支持使用的本体不能有多重继承，这是严重的不足，因为很多大且有意义的领域本体都是使用多重继承的。（４）该方法特别在于使用SEAL技术和它对网页环境的泛化是不明确的。

文献６使用了查询词扩展和优化，但是出于不同目的。这更关注怎么样建立一个小概念格来支持查询词扩展，很少在方法上使用这样的格。另一方面，我们的方法关注怎么样聚合和使用包含在基于知识的库里的知识，更有效的改善查询。

９总结

提出了一个方法，使用ＲesearchCyc，他是关于不同领域语义知识的主体。该方法是基于之前在语义，语言，知识库上的研究。该方法在一个原型里面实现并应用到网页查询。从原型得出的初步结果相当好，但是，需要确定使用不现类型的查询词和领域的有效性来提供更确实的证据。需要更进一步的工作来确实该方法不能产生好的搜索结果的环境。

本研究提出了向个改善网页查询的方法。第一，它证明了使用语义和语言知识来提高查询词扩展。第二，本方法确定和更新网页查询问题和提出一个查询分类的方法来解释为什么在一些情况下查询词扩展不能成功，即使支持这个任务的库是完整的。第三，本方法展示了一个本体结果应该包括概念和泛化关系来表示内涵和外延信息。包含启发式和完整性约束方便推导出新信息来创建智能查询词。

尽管ResearchCyc包含有用的知识来支持网页查询，但是跟从WordNet中的语言知识和从互联网来的现实知识一起更有效率。接下来的工作将会集中在扩展方法的启发性和自动处理该方法的不同方面（比如歧义）。同时也检查整合ResearchCyc跟其它本体、搜索引擎的可行性。