周末在家闲的慌,把在公司搞的nutch拿回来在家里试着做了下,刚开始总以为把一个项目都以plugin的形式加入到nutch中来,是不是有点.......嗨,搞呗。周日竟然成功了,分享下新得先。
先贴出来先日志的东东吧,我们要求的是nutch边从网站上爬取,所加入的项目(plugin)得边去解析(抽取)数据。就这么简单,插件的介绍在上几次已经介绍过了,看下日志:
product_name = The Incident (CD)
product_price = $14.01
product_image = http://i43.tower.com/images/mm113708247/incident-porcupine-tree-cd-cover-art.jpg
product_category = Music Rock & Pop Progressive Rock
product_description = ? ? ?? ??? ???Learn more about the format using Tower WIKI. September 15, 2009 1 016861785727 113708247 #748 in Music (See ) #347 in Rock & Pop (See ) #2 in Progressive Rock (See )
product_review = To sample an individual track, click the button located beside your desired song.
product_type = dvd
product_url = http://www.tower.com/incident-porcupine-tree-cd/wapi/113708247
至于乱码,先不用管,可能是上次程序的bug吧。这也是在澳门回归十周年的这一天搞出来的啊。不过也得先祝贺一下我们的祖国繁荣昌盛,越来越强大!今天先写到这吧!