Chinese corpus Training 參數的使用-优快云博客

转载来自:http://puremonkey2010.blogspot.tw/2012/08/stanford-parser-chinese-corpus-training.html

前言 :
stanford parser 是一個可進行短語結構和依存結構分析的parser，網絡上的資料很多，而且在stanford nlp 的網站上也有很多說明，代碼中的 readme 文件數的也很詳細。在這裡簡要記錄一下我學習的一些過程. 這邊紀錄我在使用 "繁體中文" corpus 進行 training 運到的問題與最後使用的參數與測試.

中文語料訓練 :
stanford parser 的源代碼下載後可直接使用，不需要做任何修改。訓練語料默認是英文的wsj語料。在使用中文訓練時需要在參數中指定:
- 訓練 : 使用中文訓練時命令為

> java -mx4000m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser
-PCFG # 使用 Probabilistic Context Free Grammar
-vMarkov 1 # use no language-specific heuristics for unknown word processing
-uwm 0 # Always just choose the left-most category on a rule RHS as the head
-tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams # 指定 TreebankLangParserParams, for when using a different language or treebank
-saveToSerializedFile train3.ser.gz # 將 serialized model 輸出到 train3.ser.gz
-maxLength 100 # Specify the longest sentence that will be parsed
-escaper edu.stanford.nlp.trees.international.pennchinese.ChineseEscaper # Specify a class to do customized escaping of tokenized text.
-train train_test.txt # training corpus 的檔案(s)
-segmentMarkov # Makes it build in a segmenter, 這個選項可以忽略.
-encoding UTF-8 # 使用 UTF-8 encoding

其中一定要加 edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams，否則無法使用中文訓練，我在剛開始使用的時候沒有註意，總是出現 :

Extracting PCFG...Exception in thread "main" java.lang.RuntimeException:
> TreeAnnotator: null head found for tree [suggesting incomplete/wrong

在訓練的時候，可以選擇是使用PCFG還是Factored，有很多參數可選擇，具體看readme文件. 使用上面的訓練命令後得到一個.gz文件。接下來可進行測試.

- 測試 :
你可以使用下面的命令列進行測試:

> java -server -mx1800m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -maxLength 200 -loadFromSerializedFile chinesePCFG.ser.gz -test ./corpus/ctb5/test.pid > ./test.result

或是自己寫代碼載入 training mode 並進行剖析 :

 
    view plain 
    copy to clipboard 
    print 
    ? 
   
 package stanford.test;  
   
 import java.util.List;  
 import edu.stanford.nlp.ling.CoreLabel;  
 import edu.stanford.nlp.ling.Sentence;  
 import edu.stanford.nlp.parser.lexparser.LexicalizedParser;  
 import edu.stanford.nlp.trees.Tree;  
   
 public class Test {  
   
     /** 
      * @param args 
      */  
     public static void main(String[] args) {  
         String sentence = "我 到 她 家 等候";  
         String sents[] = sentence.split(" ");  
         LexicalizedParser lp = LexicalizedParser.loadModel("train3.ser.gz");  
         List rawWords = Sentence.toCoreLabelList(sents);  
         Tree parse = lp.apply(rawWords);  
         System.out.printf("\t[Info] Parsing result:\n%s\n", parse.toString());  
     }  
 }  

執行結果 :

Loading parser from serialized file train3.ser.gz ... done [0.1 sec].
[Info] Parsing result:
(ROOT (S (NP (Nh 我)) (PP (P 到) (NP (Nh 她) (Nc 家)) (VK 等候))))