单词表达中的语言规律Linguistic Regularities in word representations

本文探讨了通过词向量解决语义类比问题的方法,包括加减词向量及改进算法,并对比了两种主要词向量模型的表现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

LINGUISTIC REGULARITIES IN WORD REPRESENTATIONS

In 2013, Mikolov et al. (2013) published a paper showing that complicated semantic analogy problems could be solved simply by adding and subtracting vectors learned with a neural network. Since then, there has been some more investigation into what is actually behind this method, and also some suggested improvements. This post is a summary/discussion of the paper “Linguistic Regularities in Sparse and Explicit Word Representations“, by Omer Levy and Yoav Goldberg, published at ACL 2014.

The Task

The task under consideration is analogy recovery. These are questions in the form:

a is to b as c is to d

In a usual setting, the system is given words a, b, c, and it needs to find d. For example:

‘apple’ is to ‘apples’ as ‘car’ is to ?

where the correct answer is ‘cars’. Or the well-known example:

‘man’ is to ‘woman’ as ‘king’ is to ?

where the desired answer is ‘queen’.

While methods such as relation extraction would also be completely reasonable approaches to this problem, the research is mainly focused on solving it by using vector similarity methods. This means we create vector representations for each of the words, and then use their positions in the high-dimensional feature space to determine what the missing word should be.

Vector Spaces

As the first step, we need to create feature vectors for each word in our vocabulary. The two main ways of doing this, which are also considered by this paper, are:

  1. BOW: The bag-of words approach. We count all the words that a certain word appears with, treat the counts as features in a vector, and weight them using something like positive pointwise mutual information (PPMI).
  2. word2vec: Vectors created using the word2vec toolkit. Low-dimensional dense vector representations are learned for each word using a neural network.

If you want to learn more details about these models, take a look at an earlier post about comparing both of these models.

Analogy Artihmetic

We have the vectors, but how can we find the correct words in the analogy task?

The idea of Mikolov et al. (2013) was that the vector offset of vectors should reflect their relation:

abcd
manwomankingqueen

We can rearrange this to find d:

dca+b
queenkingman+woman

Therefore, we can calculate the vector for  ca+b , compare it to the vectors of all words in our vocabulary, and return the highest-scoring word as the answer.

dw=argmaxdwV(cos(d,ca+b))

I’m using  dw  here to denote the actual word, whereas the  d  stands for the vector representation of  dw V  is our whole vocabulary.  cos()  is the cosine similarity between two vectors. Let’s refer to this as theaddition method.regularities_equation

An alternative to this is to consider the directions of the offset directly. We find the word that has the most similar offset (in terms of direction), compared to the offset of the sample words.

dw=argmaxdwV(cos(ab,cd))

It only looks at the direction of the relational step, ignoring the actual magnitude of this step. We’ll see later that sometimes this is good, sometimes not so much. We’ll refer to this method as direction. Apparently this method was actually used by Mikolov in some experiments, although it wasn’t included in the paper.

Now onto the modification proposed by Levy & Goldberg (2014). The addition method basically looks for a word that is similar to c and b, and not similar to a, with each contributing equally. This can lead to cases where one large term dominates the equation, especially if the magnitudes of the similarities are very different. Therefore, the authors have proposed that instead of adding these similarities, they could be multiplied:

dw=argmaxdwVcos(d,c)cos(d,b)cos(d,a)+ϵ

If the vectors are normalised (unit length), then this formula actually becomes a multiplication of their scalar products:

dw=argmaxdwV(dc)×(db)(da)+ϵ

ϵ  is set to 0.001 to avoid division by zero. Also, the formula doesn’t handle negative similarity scores, whereas cosine (and the scalar product) can be negative. Therefore, the values are transformed into a positive range by doing  xnew=(x+1)/2 . We’ll refer to this as the multiplication method.

Evaluation

The alternative methods are evaluated on the task of analogy recovery, using three different datasets:

  1. The MSR dataset containing 8,000 analogy questions.
  2. The GOOGLE dataset with 19,544 analogy questions.
  3. The SEMEVAL dataset, covering 79 distinct relation types.

In the MSR and GOOGLE datasets, the system needs to find the correct answer from an open vocabulary. In contrast, the SEMEVAL task provides candidates for each relation that need to be reranked. Accuracy is used as the evaluation metric on all datasets.

The word vectors were trained on 1.5 billion words of English Wikipedia. The vocabulary was restricted to contain only words that occurred at least 100 times in the corpus, resulting in 189,533 words.

Results

regularities_results

 

Some observations based on the results:

  • The multiplication method outperforms the additon method on two tasks, and gives comparable results on the third one.
  • The direction method performs horribly on the MSR and GOOGLE tasks, but gives the best results on SEMEVAL. This is because the method only looks at the direction of the relation, but doesn’t care if the result is actually positioned very far in the semantic space. The SEMEVAL task restricts the candidates to a small list, and the method is able to pick the best option, but doesn’t perform as well on open vocabulary.
  • The BOW model is generally worse than word2vec when using the addition method, but is comparable to the best results when using multiplication. This indicates that the property of linguistic regularities applies to both classical BOW and neural network representations.

Discussion

Based on my experience with applying different vector operations to discover linguistic regularities, I’d like to point out two observations.

First, while the ability to recover semantic relations in such a way is remarkable, it is not quite as impressive as it seems at first glance. Let’s take the well-known example of king – man + woman = queen. Depending on the parameters I used, several vector spaces I built actually gave queen as the most similar word to king, with no vector operations needed. This makes sense, because king and queen are semantically very related, and probably occur in very similar contexts. But it also means that (woman – man) only needs to be a zero vector or some small random noise in order to give the correct answer. I’m not saying this is a major factor in these experiments, but I would encourage future work to also report baselines without using any vector operations.

Second, I observed that the quality of these regularities highly depends on the examples that are provided, and how they relate to the query word. For my work (Rei & Briscoe, 2014) I looked at hyponym relations, and below are some examples when trying to find hyponyms for bird.

regularities_mytable

The greyed out terms are the query terms, and it is common practice to ignore these in the results; the bold results are correct answers. As you can see, just the plain vector similarity returns one bird species correctly in the top results. When using (salmon,fish) as the example pair, the system is quite good and returns a whole lot more. But as I start to use hyponym examples from other domains, the quality degrades fast, and (bird-treatment+therapy) actually returns worse results than the basic vector similarity.

I believe there is still plenty of space to further improve out understanding of how we can operate on various types of features, in order to model different semantic relations and compositional semantics.

References

Levy, O., & Goldberg, Y. (2014). Linguistic Regularities in Sparse and Explicit Word Representations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 171–180).

Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations, (June), 746–751.

Rei, M., & Briscoe, T. (2014). Looking for Hyponyms in Vector Space. InCoNLL-2014 (pp. 68–77).

from: http://www.marekrei.com/blog/linguistic-regularities-word-representations/

内容概要:本文从关键概念、核心技巧、应用场景、代码案例分析及未来发展趋势五个维度探讨了Python编程语言的进阶之路。关键概念涵盖装饰器、生成器、上下文管理器、元类和异步编程,这些概念有助于开发者突破基础认知的核心壁垒。核心技巧方面,介绍了内存优化、性能加速、代码复用和异步处理的方法,例如使用生成器处理大数据流、numba库加速计算密集型任务等。应用场景展示了Python在大数据处理、Web开发、人工智能和自动化运维等多个领域的广泛运用,特别是在FastAPI框架中构建异步API服务的实战案例,详细分析了装饰器日志记录、异步数据库查询和性能优化技巧。最后展望了Python的未来发展趋势,包括异步编程的普及、类型提示的强化、AI框架的深度整合以及多语言协同。 适合人群:已经掌握Python基础语法,希望进一步提升编程技能的开发者,特别是有意向从事数据科学、Web开发或AI相关工作的技术人员。 使用场景及目标:①掌握Python进阶概念和技术,如装饰器、生成器、异步编程等,提升代码质量和效率;②学习如何在实际项目中应用这些技术,如通过FastAPI构建高效的异步API服务;③了解Python在未来编程领域的潜在发展方向,为职业规划提供参考。 阅读建议:本文不仅提供了理论知识,还包含了丰富的实战案例,建议读者在学习过程中结合实际项目进行练习,特别是尝试构建自己的异步API服务,并通过调试代码加深理解。同时关注Python社区的发展动态,及时掌握最新的技术和工具。
内容概要:本文档《Rust系统编程实战》详细介绍了Rust在系统编程领域的应用,强调了其内存安全、零成本抽象和高性能的特点。文档分为三个主要部分:核心实战方向、典型项目案例和技术关键点。在核心实战方向中,重点讲解了unsafe编程、FFI(外部函数接口)和底层API调用,涉及操作系统组件开发、网络编程、设备驱动开发、系统工具开发和嵌入式开发等多个领域,并列出了每个方向所需的技术栈和前置知识。典型项目案例部分以Linux字符设备驱动为例,详细描述了从环境搭建到核心代码实现的具体步骤,包括使用bindgen生成Linux内核API的Rust绑定,定义设备结构体,以及实现驱动核心函数。 适合人群:对系统编程有兴趣并有一定编程基础的开发者,尤其是那些希望深入了解操作系统底层机制、网络协议栈或嵌入式系统的工程师。 使用场景及目标:①掌握Rust在不同系统编程场景下的应用,如操作系统组件开发、网络编程、设备驱动开发等;②通过实际项目(如Linux字符设备驱动)的学习,理解Rust与操作系统内核的交互逻辑;③提高对unsafe编程、FFI和底层API调用的理解和运用能力。 阅读建议:由于文档内容较为深入且涉及多个复杂概念,建议读者在学习过程中结合实际操作进行练习,特别是在尝试实现Linux字符设备驱动时,务必按照文档提供的步骤逐步进行,并多加调试和测试。
内容概要:本文针对现有配电网灵活性评估方法对网络传输能力考虑不足的问题,提出了一种新的评估方法。该方法首先建立了配电网灵活性供需模型,分析了4种供需匹配情况,接着提出3类灵活性评估指标,构建了以运行成本最低为目标的优化调度模型。通过改进的IEEE33节点配电网仿真验证了方法的有效性。重点解决了高比例分布式电源接入带来的波动性问题,为配电网灵活性评估提供了新思路。文中还详细介绍了MATLAB代码实现,涵盖参数初始化、灵活性需求和供给计算、评估指标计算、优化调度模型及可视化结果等方面。此外,对灵活性供需匹配的4种情况进行深入分析,并扩展实现了完整的灵活性评估系统,增加了动态时间尺度、增强可视化和实用扩展等功能,提升了系统的可扩展性和实用性。; 适合人群:从事电力系统研究、配电网规划与运营的专业人士,特别是关注分布式电源接入和电网灵活性评估的研究人员和技术人员。; 使用场景及目标:①评估含高比例分布式电源的配电网灵活性,解决DG接入带来的波动性问题;②通过优化调度模型最小化运行成本,提高配电网的运行效率;③利用扩展实现的系统进行多时间尺度仿真和不同场景下的对比分析,支持实际工程应用。; 其他说明:此资源不仅提供了详细的理论分析和MATLAB代码实现,还通过模块化设计增强了代码的可扩展性和实用性。建议读者结合具体配电网参数调整设备容量约束,根据当地电价政策优化成本系数,并采用历史数据训练更精确的场景生成模型。同时,可以通过并行计算加速仿真过程,采用交叉验证和蒙特卡洛仿真验证结果的稳定性和鲁棒性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值