斯坦福NER - 无法识别电话号码(Stanford NER - Unable to identify Phone number)

博客围绕斯坦福NER无法识别电话号码展开，训练后测试仍无法识别。提出可使用RegexNER代替，给出示例句子和规则文件，运行命令可识别电话号码。同时指出标记器将含空格号码变为一个标记，编写含空格标记的正则表达式存在问题，并给出解决思路。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

原文：斯坦福NER - 无法识别电话号码(Stanford NER - Unable to identify Phone number)_电脑培训 (656463.com)

我正在训练我的NER到实体类型Phonenumber，其词类是数字。然而，当我测试与我训练过的相同数据时，分类器未识别电话号码。

那是因为电话号码的词性（POS）是数字（CD）吗？

I am training my NER to the entity type Phonenumber whose part of speech is number. However when I test the same data that I have trained, the phone number is not identified by the classifier.

Is that because the part of speech(POS) of phone number is number(CD)?

原文：https://stackoverflow.com/questions/42416550

更新时间：2022-08-19 16:08

最满意答案

您可能想使用regexner来代替这个用例。

考虑这个句子（把它放在phone-number-example.txt中）：

You can reach the office at 555 555-5555.

如果你制作一个像这样的regexner规则文件（注意每列都是制表符分隔的）

[0-9]{3}\W[0-9]{3}-[0-9]{4}     PHONE_NUMBER    MISC,NUMBER     1

并运行此命令：

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -regexner.mapping phone_number.rules -file phone-number-example.txt -outputFormat text

它将识别输出NER标记中的电话号码。

有一个问题需要注意。您会注意到标记器会将“555 555-5555”变成一个标记。规则文件的第一列是一个匹配令牌的正则表达式。正则表达式模式是一个空格分隔的模式列表，它匹配您想要标记的每个标记。

所以在这个例子中，我制定的规则有一个“\ W”来捕获空间。当我使用“\ s”等等时，规则不起作用。所以我认为编写包含空格的标记的正则表达式存在问题。典型的令牌不包含空间。

所以你可能想通过扩展“\ W”来解决这个问题，排除你不想要的其他字符，因为“\ W”只是表示非单词字符。此外，您显然可以使我列出的模式更加复杂，并捕获各种电话号码模式。

更多关于Regexner的信息可以在这里找到：

The Stanford Natural Language Processing Group

You might want to use regexner instead for this use case.

Consider this sentence (put it in phone-number-example.txt):

You can reach the office at 555 555-5555.

If you make a regexner rules file like this (note each column is tab separated)

[0-9]{3}\W[0-9]{3}-[0-9]{4}     PHONE_NUMBER    MISC,NUMBER     1

And run this command:

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -regexner.mapping phone_number.rules -file phone-number-example.txt -outputFormat text

It will identify the phone number in the output NER tagging.

One issue to look out for. You will note the tokenizer turns "555 555-5555" into one token. The first column of the rule file is a regex that matches a token. The regexner patterns are a space separated list of patterns that match each token you want to ner tag.

So in this example, the rule I made has a "\W" to capture the space. The rule wasn't working when I used "\s", etc..so I think there is an issue with writing regexes for tokens that contain spaces. Typically tokens don't contain spaces for that matter.

So you might want to work around this by expanding on "\W" and excluding other characters that you don't want since "\W" just means non-word characters. Also, you can obviously make the pattern I just listed more complicated and capture the various phone number patterns.

More info on RegexNER can be found here:

The Stanford Natural Language Processing Group