Python自然语言处理学习笔记(70):8.2 语法有什么作用

本文探讨了语法在文本生成中的作用,通过分析n-grams的使用限制,展示如何利用语法概念框架来识别和纠正不合理序列,进一步阐述了构成结构的概念以及如何通过替换短语实现句子的规范化。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

8.2   What's the Use of Syntax? 语法有什么作用?

Beyond n-grams  n-grams之外

We gave an example in Chapter 2 of how to use the frequency information in bigrams to generate text that seems perfectly acceptable for small sequences of words but rapidly degenerates into nonsense. Here's another pair of examples that we created by computing the bigrams over the text of a childrens' story, The Adventures of Buster Brown (http://www.gutenberg.org/files/22816/22816.txt):

(4)

a.

He roared with me the pail slip down his back

b.

The worst part and clumsy looking for whoever heard light

You intuitively know that these sequences are "word-salad", but you probably find it hard to pin down(确定)what's wrong with them. One benefit of studying grammar is that it provides a conceptual framework(概念框架)and vocabulary for spelling out these intuitions. Let's take a closer look at the sequence the worst part and clumsy looking. This looks like a coordinate structure(并列结构), where two phrases are joined by a coordinating conjunction such as and, but or or. Here's an informal (and simplified) statement of how coordination works syntactically:

Coordinate Structure:

If v1 and v2 are both phrases of grammatical category(文法种类) X, then v1 and v2 is also a phrase of category X.

Here are a couple of examples. In the first, two NPs (noun phrases) have been conjoined to make an NP, while in the second, two APs (adjective phrases) have been conjoined to make an AP.

(5)

a.

The book's ending was (NP the worst part and the best part) for me.

b.

On land they are (AP slow and clumsy looking).

What we can't do is conjoin an NP and an AP, which is why the worst part and clumsy looking is ungrammatical. Before we can formalize these ideas, we need to understand the concept of constituent structure(成分结构).

Constituent structure is based on the observation that words combine with other words to form units. The evidence that a sequence of words forms such a unit is given by substitutability — that is, a sequence of words in a well-formed sentence can be replaced by a shorter sequence without rendering(表现) the sentence ill-formed. To clarify this idea, consider the following sentence:

(6)

The little bear saw the fine fat trout in the brook.

The fact that we can substitute He for The little bear indicates that the latter sequence is a unit. By contrast, we cannot replace little bear saw in the same way.

(7)

a.

He saw the fine fat trout in the brook.

b.

*The he the fine fat trout in the brook.

In Figure 8.1, we systematically substitute longer sequences by shorter ones in a way which preserves grammaticality. Each sequence that forms a unit can in fact be replaced by a single word, and we end up with just two elements.

wps_clip_image-13229[6]

Figure 8.1: Substitution of Word Sequences: working from the top row, we can replace particular sequences of words (e.g. the brook) with individual words (e.g. it); repeating this process we arrive at a grammatical two-word sentence.

In Figure 8.2, we have added grammatical category labels to the words we saw in the earlier figure. The labels NP, VP, and PP stand for noun phrase, verb phrase and prepositional phrase respectively.

wps_clip_image-18355[6]

Figure 8.2: Substitution of Word Sequences Plus Grammatical Categories: This diagram reproduces Figure 8.1 along with grammatical categories corresponding to noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and nominals (Nom).

If we now strip out(删掉) the words apart from the topmost row, add an S node, and flip the figure over(把图形翻转), we end up with a standard phrase structure tree, shown in (8). Each node in this tree (including the words) is called a constituent(成分). The immediate constituents(直接成分) of S are NP and VP.

(8)

wps_clip_image-17950[6]

As we will see in the next section, a grammar specifies how the sentence can be subdivided into its immediate constituents, and how these can be further subdivided until we reach the level of individual words.

Note

As we saw in Section 8.1, sentences can have arbitrary length. Consequently, phrase structure trees can have arbitrary depth. The cascaded chunk parsers we saw in Section 7.4 can only produce structures of bounded depth, so chunking methods aren't applicable here.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值