Improving web-query processing through semantic knowledge and user feedback－2

最新推荐文章于 2025-06-08 11:26:44 发布

翻译最新推荐文章于 2025-06-08 11:26:44 发布 · 805 阅读

文章标签：

#semantic #user #processing #query #classification #extension

本文探讨了查询优化技术，包括交互式查询改进方法及语义理解在查询处理中的应用。介绍了如何通过用户反馈和语义知识提升查询效果，并提出了查询分类的方法，如内涵查询与外延查询的区别。

Several research efforts have attempted to minimize these problems. Interactive approaches, such as query refinement, help the user identify better terms for a query [7] and [17]. These techniques show the user information related to Q_W. The user then uses this information to learn more about the domain and identify better terms or concepts for the query. These techniques are also useful when the user is not certain about what he or she is searching for, and needs a conceptualization of the intended domains. For example, these techniques help the user to redefine an initial query about pets (namely, “Pet”) to a less ambiguous query after the user realizes his or her interests are about buying domestic cats, “Buy Persian Cat Atlanta”. Other approaches use linguistic repositories for minimizing the problem of query ambiguity by using synonyms, hypernyms, and hyponyms. Nevertheless, there does not appear to be any technique that enables the user to identify when the results of the query are relevant.

2.2. Query classification

Initial experiments using the ResearchCyc knowledge base for query expansion and refinement reveal that there are some queries that cannot be improved using ResearchCyc or any other semantic repository. Recall the query where the user is searching for Nike sport shoes in Georgia. Nike is not a concept, but an instance in ResearchCyc. The extensional information² of this repository indicates that it is a commercial organization with the name “Nike,” and it has an affiliated called Michael Jordan. However, ResearchCyc does not give any information about what this organization sells, and therefore, the contained knowledge is useless. Thus, one of the contributions of this research is to identify classes of queries that could or could not take advantage of the knowledge in ResearchCyc. The classification takes into account the types of knowledge that are needed to improve a web query.

– Intensional query: This type of query obtains intensional information about E_R. This kind of query is directed by the necessity of the user increasing his or her knowledge. Only intensional terms should be used to define the query. For example, the query “Buy a Pet Cat” is intensional because the user is interested in learning something new about buying cats, such as which kinds of shops sell cats, and guidelines for buying a cat.

Usually knowledge repositories use only intensional information to define their knowledge. Therefore, any knowledge repository that deals with the intended domains of an intensional query can be used successfully to improve such a query.

– Extensional query: The purpose of this type of query is to obtain extensional information about E_R. The user is not interested in learning abstract knowledge, but specific knowledge. This kind of query tends to be motivated for more practical needs. In the query “Nike Georgia” the user may be interested in the shops in Georgia that sell Nike products, or the Nike factories located in the country of Georgia. This query is extensional because the user is not interested in general information about USA states or commercial companies. The user is interested in obtaining information about particular instances of those concepts (Nike is an instance of a commercial company and Georgia is an instance of the State in USA).

The user is not interested in the possible relationship types between the concepts, but in the real relationships between them. For instance, Business has a relationship type (in ResearchCyc) called parentCompany that relates one business to another business which is its parent company. In an intentional query, the user may be interested in knowing that sometimes a business may have a parent company. However, in extensional queries, the user is only interested in this relationship when there is an instance of the relationship in which Nike is a participant.

To improve this kind of query, knowledge repositories are needed that include both the intension of the intended domains, and their extension. The more extensional information the repository has, the better are the results obtained through query expansion. Unfortunately, the actual knowledge repositories do not take into account extensional knowledge; they only exemplify the intensional knowledge. Therefore, the actual knowledge repositories are useless for improving this kind of query.

An intensional or extensional query may also contain other kinds of information, which helps to contextualize the query. In some cases, this additional knowledge allows discarding of domains that do not fit with the user’s intended domains. For example, “Nike Georgia buy” is an extensional query (“Nike Georgia”) that contains intensional information (buy) that allows delimiting the purpose of the user: buying. On the other hand, the query “Flute Bohemian Drink” is an intensional query (“Flute Drink”) with extensional information (“Bohemian”) that allows contextualizing our query. We are only interested in information regarding Flute and Drink related in any way with a Bohemian instance, which may be the instance of a geographical region (Czech Republic), a music movement, etcetera.

For queries that use intensional and extensional knowledge, we need to use knowledge repositories that represent both of these types of knowledge for the intended domains. Table 2 summarizes the roles of the lexical, semantic, and factual information in the query process.

Table 2.

Roles of the different kind of information in the query process

	Necessary to			Other effects
	Find concepts related to query	Intensional queries	Extensional queries
Synonymy	√			– Increase the number of the intended domains – Prioritize the intersection of the obtained domains

Relationship
Taxonomic		√		– Produces the expansion or pruning of the obtained domains
Non-taxonomic		√		– Allows inferring candidate domains – Allows focusing of the query in the candidate domains
Factual			√	– Allows inferring candidate domains in intensional queries

3. Semantics in query processing

The use of semantic, linguistic, and factual knowledge for web queries may require common sense knowledge. The following shows the kind of knowledge and structure an ontology should have to improve any kind of query (extensional or intensional). These types of knowledge could be combined to form an ontology that contains, for example, the semantic information of ResearchCyc, the linguistic information from WordNet, and the factual information from the Web. The ontology could then be used to facilitate the processing of a web query. The semantic information is necessary to represent the intended domains, The linguistic information is necessary to identify the concepts that represent the query terms and to define the linguistic relationships (synonymy, antonymy, etc.) of such concepts. The factual information is necessary to represent the particular objects of the real world such as the cities of the United States or the pet stores of a given city. Since such an ontology should contain semantic, linguistic, and factual knowledge, it should be composed of a set [18]:

– Concepts: A concept is something we have created in our mind in order to generalize the properties a set of objects have in common. A concept has an extension and an intension. The extension is the set of all its possible instances; the intension is the set of common properties of all its instances. There are two kinds of concepts:

– Classes: A class represents concept type; it can be specialized or generalized;

– Properties: A property relates objects, and describes their interactions or properties.

– Individuals: These are the extension of the concepts and represent a particular concept of a real world (tangible or intangible).

– Classification relationships between an individual and a concept. The fact that i is an instance of concept c (either an entity type or a relationship type) is denoted as InstanceOf(i, c).

– Generalization relationships between concepts: These binary relationships organize the concepts in a tree structure using generalization/specialization. These relationships specify that a concept (child) is a subtype of another concept (parent), and therefore, the extension of a child must be included in the extension of the parent. This relationship type is an inclusion integrity constraint between the parent and the child concepts. However, due to the important role of such relationships in the ontologies, these relationships are defined explicitly such as a relationship type.

The concepts and the generalization relationships are needed to conceptualize the intensional information about the domains to model. On the other hand, the individuals and the classification relationships are needed to represent the extensional information of our conceptualization. A concept of the ontology may represent either semantic or linguistic information. The rationale is that the ontology should represent the synonyms of each concept, when a concept is a compound name composed for two or more nouns and so on. For example, if there is a semantic concept in the ontology called DomesticPet the ontology should also contain linguistic information which indicate that its name is a noun comprised of two words: Domestic and Pet, and is the denotation of the word Pet.

Although the heuristics are not necessary in an ontology, they enable inference. For example, ResearchCyc defines a heuristic for the concept DomesticPet which indicates that most pets have a pleasant personality. If the user is searching for a friendly animal, the prior information can be used to infer that he or she is interested in a DomesticPet because there is a generalization relationship between Pet and Animal, a synonymy relationship between friendly and pleasant, and a heuristic which indicates that Pets tend to be friendly.

Similarly, constraints are not needed but may be useful for detecting some concepts (or contexts) for which the user is interested. For example, suppose the user is interested in obtaining information about the cat, Garfield, but does not remember which kind of animal Garfield is. The user’s query: “Garfield Pet” may result in irrelevant web pages because Garfield is also a surname. The disjoint integrity constraint between the concepts Cat and Person (defined in the ResearchCyc ontology) may be used to determine that the user is not interested in persons, and therefore, modify the query to discard the web pages that deal with persons.

Although most ontology languages facilitate representing instances, most of the ontologies used to support web queries neither represent the instances of the conceptualization they are modeling, nor the linguistic information about their concepts. As a result, they are not able to improve correctly the user requests that use extensional knowledge, and must use linguistic external repositories to identify which concepts in the ontology are related with the terms provided by the user.

有一些研究试图减少这样的问题。交互方法，比如查询词优化，帮助用户识别好一些的查询词。这些技术把查询词相关的信息显示给用户。用户使用这些信息来学习更多领域知识和识别查询词更好的词项或概念。当用户不确定要找什么和需要对预期领域进行概念化时，这些技术是有用的。例如，当用户意识到他的兴趣是买一只家养的宠物时，这些技术就可以帮助用户把初始查询词“pet”优化为模糊性比较小的查询词“Buy Persian Cat Atlanta”。其它方法通过使用语言库中的同义词，多义词和上下位关系来减少查询词模糊性问题。但是，不存在任务技术使得用户可以识别那些查询结果是相关的。

２．２查询词分类

使用ResearchCyc知识来进行查询词扩展和优化的先前实验，显示出有一些查询词不通过使用ResearchCyc和其它语义库来优化。回想一下用户搜索Nike sport shoes in Georgia所用的查询词。Nike不是一个概念，是ResearchCyc中的一个实例。这个库中的外延知识指出，这是一个商业组织名字叫“Nike”，它有一个会员叫Michael Jordan。但是，ResearchCyc没有给出该组织所卖的产品的任何信息，因此，这个外延知识是没用的。这样，本研究的贡献之一就是识别查询词的类别来区分可不可以利用ResearchCyc里面的知识。这个分类把改善网页查询所知识的类型考虑进去。

--内涵查询：这类查询获得ＥＲ的内涵信息。这类查询是由用用户提高知识的需要所控制的。必须用一个内涵查询词来定义一个查询。例如：查询词“buy a pet cat”内涵的，因为用户的兴趣在学习买猫的新知识，比如那些店卖猫，和怎么样买猫。

知识库经常使用一个内涵信息来定义他们的知识。因此，处理一个内涵查询的预期领域的知识库可以成功的用来改善这样的查询。

――外延查询：此类查询的目的是获得ＥＲ的外延信息。用户的兴趣不在于学习抽象知识，而在于学习具体知识。这类查询趋向于由更实际的需要激发的。在查询“Nike Georgia”里，用户可能对佐治亚卖Nike产品的商店感兴趣，或者是在佐治亚郊区的Nike工厂。这类查询是外延的，因为用户不是对USA州或是商业公司的一般信息感兴趣。而是对获得这些概念的特别事例的信息感兴趣。

用户对概念之间的可能联系不感兴趣，而是对他们之间的真正联系感兴趣。例如，商业有一个联系类型叫母公司，它把一个商业跟一个总公司的商业联系起来。在内涵查询中，用户可能有兴趣知道一个商业有时候可以有一个母公司。然而，外延查询中，用户只是对这样的联系感兴趣，当存在Nike是一个参与者的关系的一个实例时。

为了改善这种查询，需要同时包括预期领域的内涵和外延的知识库。库中包括的外延信息越多，通过查询词扩展获得的结果更好。不幸的是，现有的库没有把外延知识考虑进去，只是对内涵知识进行了举例，因此，现有的库对改善这样查询是没有用处的。

一个查询的内涵和外延也可以包括把查询词溶入背景中的其他类型信息。在一些情况下，这些附加知识允许去掉不符合用户预期领域的领域。例如，“Nike Georgia buy”是“Nike Georgia”的外延，它包括内涵词“buy”来限定用户的目的：买。另一方面，查询“Flute Bohemian Drink”是一个由一个内涵查询“Flute Drink”一个外延信息“Bohemian”组成，允许将查询溶入背景中。我们只对跟Bohemian实例有联系的Flute和Drink的信息感兴趣，这些实例可以是地理区域的（Czch Republic），一部音乐电影等等。

为了使用内涵和外延知识来查询，我们需要使用包括预期领域中这两种类型信息的知识库。表２总结了在查询过程中词汇，语义，现实信息的作用。

３．在查询中的语义

网页查询中使用语义，语言和现实知识需要常识。下面展示了一个本体改善任何查询所需的知识和结构。这些知识联合一起形成一个本体，这个本体包括例如ResearchCyc的语义知识、WordNet的语言知识、来自网络的实际信息。然后使用这个本体来改善网页查询过程。语义信息是表示预期领域必不可少的，语言信息是识别表示查询语这些概念和定义这些概念的语言关系必不可少的。实际信息是表示现实世界中的特定对象，比如美国一个城市或者给定城市的宠物店。由于这样一个本体应包括语义、语言和实际知识，本体应包括下面的集合：

－概念：有时候我们脑海中建立一个概念是为了概括对象共有的属性的集合。一个概念有外延和内涵。外延是所有可能实例的集合。内涵就是所有实例的共有属性集。有两种类型的概念：－分类：一个分类表示概念类型，它可以具体化或者是泛化。二性能：一个性能是跟对象相联系的，它描述对象之间的相互作用和属性。

－个体：个体是概念的外延，表示现实世界的一个特定概念。（确定和不确定的）

－个体和概念之间的分类关系：i是概念c的一个实例，表示为InstanceOf(i,c)。

－概念之间的泛化关系：这些二元关系使用泛化或者具体化把概念组织成一棵树。这些关系指定一个概念（孩子）是另一个概念（超）的子类型，因此，一个孩子的外延必须包括在超概念的外延中。联系类型是超概念与子概念之间的包含完整性约束。由于在本体中这些联系的重要性，所以明确的把这些联系定义为联系类型。

概念和泛化关系是把关于领域的内涵信息概念化成模型所需要的。另一方面，个休和分类关系是表示概念化外延信息所需要的。一个本体的概念可以表示语义信息或者语言信息。基本原理是，当一个概念是包括两个或者更多名词的复合词时，本体应该可以表示概念的同义词。例如：如果在本体中有一个语义概念叫“DomesticPet”，本体应该包括语言信息，指出这个概念是包含两个词的名词：domestic和pet，是词pet的外延。

尽管启发式在一个本体中不是必须的，但可以使推理变成可能。例如，ResearchCyc为DomesticPet定义一个启发式，指示大部分猫有友好的个性。如果用户在搜索一只友好的动物，就可以用之前的信息来推论出用户用户是对DomesticPet感兴趣，因为Ｐet和Animal有泛化关系，friendly和plesant有同义的关系，启发式指出pets是友好的。

同样，约束不是必须的，但在检测用户对那些概念感兴趣方面是有用的。例如，假如用户想获取加菲猫的信息，但是不知加菲是什么动物。用户的查询词“Garfield Pet”可能导致不相关网页，因为Garfield同时也是一个姓。Cat和Person断续完整性约束可以用来确定用户对人不感兴趣，因些修改查询词去掉那些跟人有关的网页。

尽管有多本体语言能很方便的表示实例，但是大部分支持网页查询的实体既不能表示概念化模型的实例，也不能表示概念的语言信息。结果是，本体不能正确的改善使用了外延知识的用户需要，必须使用语言的外部库来确定在本体中那些概念跟用户提供的词是有关系的。