Week7-2POS tagging

本文介绍了词性标注的基础知识,包括开放类和封闭类词性的定义,以及宾州树库(Penn Treebank)标签集的使用。探讨了词性标注在解析、机器翻译和词义消歧等任务中的应用,并概述了几种主要的技术方法,如基于规则的方法、基于机器学习的方法和转换基方法。此外,还讨论了词性标注面临的挑战,如词汇的多义性和上下文依赖性。

POS

  • Open class
    • nouns, non-modal verbs, adjectives, adverbs
  • Closed class
    • prepositions, modal verbs, conjunctions, particles, determiners, pronouns

Penn Treebank tag set: the label IN indicates all of the prepositions except for TO which has only to, even if to is the type of particle.

Some observations

  • ambiguity: the tag type of the words, and even the pronunciation of the words could be different

Useful for parsing, machine translation, word sense disambiguation, etc.

Main techniques

  • rule-based
  • machine learning(crf, maximum entropy, markov models)
  • transformation-based

这里写图片描述

Source of information

  • Knowledge about individual words (unigram)
    • lexical information
    • spelling(-or, -er)
    • capitalization(IBM)
  • Knowledge about neighboring words

Evaluation

  • Baseline(relatively high)
    • tag each word with its most likely tag
    • tag each OOV word as a noun
    • accuracy around 90%
  • current accuracy
    • around 97% for english
    • 98% for human performance

Rule-based tagging

  • use dictionary or finite-state transducers to find all possible POS
  • use disambiguation rules
    • e.g., ART + V( article + verb is never allowed )
  • hundreds of rules can be designed

Rule examples

这里写图片描述

Useful for unseen languages

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值