利用Stanford Parser进行中文观点抽取(附代码)

本文探讨了观点抽取技术在理解用户对产品评价中的应用,通过文本处理和语法分析,识别出特征词及其相关的情感观点词,以提高产品评价分析的准确性和效率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

问题:

所谓的观点抽取就是从文本中获取关于某个特征词的观点词语。特征词在句子结构中通常为主语或者宾语,从词性上看一般为名词或者形容词,而观点词通常为带有情感色彩的形容词或者副词。观点词的抽取在用户对产品评价分析中非常有用。

例如:在句子“卖家 的 服务 态度 不错 , 快递 也 很 迅速”这个句子中,“服务”和“快递”是两个描述卖家的特征词,而“不错”和“迅速”则是这两个词的观点词。

 

方法:

1.选择文本数据(数据源,如产品评论文本等)

2.对文本进行断句和分词

3.筛选相关句子(找出含有特征描述对象的句子,直接匹配)

5.语法分析(Stanford Parser)

6.抽取观点词(遍历stanford-parser生成的语法结构树,找到离特征词节点最近的观点词节点,具体参见下面代码)

 

代码:

这里给的代码直接略过了前面几步,输入为:分词后的句子和特征词,输出:该特征词的观点词。

 

 

package textAnalysis;

 

import java.io.StringReader;

import java.util.Iterator;

import java.util.List;

 

import edu.stanford.nlp.ling.HasWord;

import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

import edu.stanford.nlp.process.Tokenizer;

import edu.stanford.nlp.trees.Tree;

import edu.stanford.nlp.trees.TreebankLanguagePack;

import edu.stanford.nlp.trees.international.pennchinese.ChineseTreebankLanguagePack;

 

public class DepedWordExtra {

 

    static String[] options = { "-MAX_ITEMS", "200000000" };

    static LexicalizedParser lp = new LexicalizedParser(

           "grammar/chinesePCFG.ser.gz", options);

 

    public static void main(String[] args) {

 

       String sentence = "老师 穿 着 一件 很 美丽 的 衣服";

       String keyword = "衣服";

       int kwIndex = 0;

       String sentArry[] = sentence.split(" ");

       for (int i = 0; i < sentArry.length; i++) {

           if (keyword.equals(sentArry[i])) {

              kwIndex = i;

              break;

           }

       }

       // System.out.println(kwIndex);

 

       extraDepWord(sentence, keyword);

 

    }

 

    private static void extraDepWord(String sentence, String keyword) {

       // TODO Auto-generated method stub

       TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();

       Tokenizer<? extends HasWord> toke = tlp.getTokenizerFactory()

              .getTokenizer(new StringReader(sentence));

       List<? extends HasWord> sentList = toke.tokenize();

       Tree parse = lp.apply(sentList);

       List<Tree> leaves = parse.getLeaves();

 

       Iterator<Tree> it = leaves.iterator();

       while (it.hasNext()) {

           Tree leaf = it.next();

           if (leaf.nodeString().trim().equals(keyword)) {

              Tree start = leaf;

              start = start.parent(parse);

              String tag = start.value().toString().trim();

              boolean extraedflg = false;

              // 如果当前节点的父节点是NN,则遍历该父节点的父节点的兄弟节点

              if (tag.equals("NN") || tag.equals("VA")) {

                  for (int i = 0; i < parse.depth(); i++) {

                     start = start.parent(parse);

                     if (start.value().toString().trim().equals("ROOT")

                            || extraedflg == true) {

                         break;

                     } else {

 

                         List<Tree> bros = start.siblings(parse);

                         if (bros != null) {

 

                            Iterator<Tree> it1 = bros.iterator();

                            while (it1.hasNext()) {

 

                                Tree bro = it1.next();

                                extraedflg = IteratorTree(bro, tag);

                                if (extraedflg) {

                                   break;

                                }

 

                            }

                         }

                     }

                  }

              }

 

           }

       }

    }

 

    private static boolean IteratorTree(Tree bro, String tagKey) {

       List<Tree> ends = bro.getChildrenAsList();

       Iterator<Tree> it = ends.iterator();

      

       while (it.hasNext()) {

           Tree end = it.next();

           String tagDep = end.value().toString().trim();

           if ((tagKey.equals("NN") && tagDep.equals("VA")) || (tagKey.equals("VA") && tagDep.equals("AD"))) {

              Tree depTree = end.getChild(0);

              System.out.println(depTree.value().toString());

              return true;

           } else if (IteratorTree(end, tagKey)) {

              return true;

           }

       }

       return false;

    }

}

About A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s. You can try out our parser online. This package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The original version of this parser was mainly written by Dan Klein, with support code and linguistic grammar development by Christopher Manning. Extensive additional work (internationalization and language-specific modeling, flexible input/output, grammar compaction, lattice parsing, k-best parsing, typed dependencies output, user support, etc.) has been done by Roger Levy, Christopher Manning, Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, Bill MacCartney, Anna Rafferty, Spence Green, Huihsin Tseng, Pi-Chuan Chang, Wolfgang Maier, and Jenny Finkel. The lexicalized probabilistic parser implements a factored product model, with separate PCFG phrase structure and lexical dependency experts, whose preferences are combined by efficient exact inference, using an A* algorithm. Or the software can be used simply as an accurate unlexicalized stochastic context-free grammar parser. Either of these yields a good performance statistical parsing system. A GUI is provided for viewing the phrase structure tree output of the parser. As well as providing an English parser, the parser can be and has been adapted to work with other languages. A Chinese parser based on the Chinese Treebank, a German parser based on the Negra corpus and Arabic parsers based on the Penn Arabic Treebank are also included. The parser has also been used for other languages, such as Italian, Bulgarian, and Portuguese. The parser provides Stanford Dependencies output as well as phrase structure trees. Typed dependencies are otherwise known grammatical relations. This style of output is available only for English and Chinese. For more details, please refer to the Stanford Dependencies webpage. The current version of the parser requires Java 6 (JDK1.6) or later. (You can also download an old version of the parser, version 1.4, which runs under JDK 1.4, or version 2.0 which runs under JDK 1.5, but those distributions are no longer supported.) The parser also requires a reasonable amount of memory (at least 100MB to run as a PCFG parser on sentences up to 40 words in length; typically around 500MB of memory to be able to parse similarly long typical-of-newswire sentences using the factored model). The parser is available for download, licensed under the GNU General Public License (v2 or later). Source is included. The package includes components for command-line invocation, a Java parsing GUI, and a Java API. The parser code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing with a ready-to-sign agreement is available. If you don't need a commercial license, but would like to support maintenance of these tools, we welcome gift funding. The download is a 54 MB zipped file (mainly consisting of included grammar data files). If you unpack the zip file, you should have everything needed. Simple scripts are included to invoke the parser on a Unix or Windows system. For another system, you merely need to similarly configure the classpath.
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值