如何爬取后加载的网页呢,简单点说我们爬取的网页应该是像firefox中的view genernated Source这样的网页内容,而不是现在搜索搜索引擎使用的view source内容。 要实现这样的效果,firefox使用的开源GECKO 引擎可以实现我们想要的效果,因为我们使用java来做爬虫的话,jrex已经为gecko做好了封装,很容易使用。
一、 介绍
"JRex" is a Java Browser Component with set of API's for Embedding Mozilla GECKO within a Java Application.
Jrex提供了对firefox的引擎的java形式的封装。默认的gecko引擎是以C++ dll的形式提供的,Jrex主要是做了JNI这层的封装,并提供了Java的接口
二、 安装
1. 下载 jrex-bin-log-1.0b1_dom3.zip和jrex_gre.jar
2. 将jrex_gre.jar的后缀改为rar,打开后将最里面的jrex_gre文件夹复制到C:\中,然后将jrex-bin-log-1.0b1_dom3.zip中的jrex.dll文件复制到 C:\jrex_gre 目录中。
3. 直接运行run.bat即可看到用jrex实现的java浏览器,还不错噢。
注意,那个JAVA_HOME应该是JRE的,而不是JDK的,否则会找不到的一个jwt.dll
"C:\Program Files\Java\jre1.5.0_06/bin/java"
三、 编程
实现效果: firefox中的view generated Source
代码如下:
- importjava.io.StringWriter;
- importjavax.swing.JFrame;
- importjavax.swing.JPanel;
- importjavax.xml.transform.OutputKeys;
- importjavax.xml.transform.Result;
- importjavax.xml.transform.Source;
- importjavax.xml.transform.Transformer;
- importjavax.xml.transform.TransformerFactory;
- importjavax.xml.transform.dom.DOMSource;
- importjavax.xml.transform.stream.StreamResult;
- importorg.mozilla.jrex.JRexFactory;
- importorg.mozilla.jrex.event.progress.ProgressEvent;
- importorg.mozilla.jrex.navigation.WebNavigation;
- importorg.mozilla.jrex.navigation.WebNavigationConstants;
- importorg.mozilla.jrex.ui.JRexCanvas;
- importorg.mozilla.jrex.window.JRexWindowManager;
- importorg.w3c.dom.Document;
- importorg.w3c.dom.Element;
- importorg.w3c.dom.Node;
- publicclassRenderimplementsorg.mozilla.jrex.event.progress.ProgressListener{
- booleandone=false;
- publicbooleanparsePage(Stringurl)throwsException{
- System.setProperty("jrex.browser.usesetupflags","true");
- System.setProperty("jrex.browser.allow.images","false");//不加载图片
- System.setProperty("jrex.browser.allow.plugin","false");//不加载flash
- //TheJRexCanvasisthemainbrowsercomponent.TheWebNavigator
- //isusedtoaccesstheDOM.
- JRexCanvascanvas=null;
- WebNavigationnavigation=null;
- //StartupJRex/Gecko.
- JRexFactory.getInstance().startEngine();
- //GetawindowmanagerandputthebrowserinaSwingframe.
- //BasedonDietrichKappe'scode.
- JRexWindowManagerwinManager=(JRexWindowManager)JRexFactory
- .getInstance().getImplInstance(JRexFactory.WINDOW_MANAGER);
- winManager.create(JRexWindowManager.SINGLE_WINDOW_MODE);
- JPanelpanel=newJPanel();
- JFrameframe=newJFrame();
- frame.getContentPane().add(panel);
- winManager.init(panel);
- //GettheJRexCanvas,setRendertohandleprogresseventsso
- //wecandeterminewhenthepageisloaded,andgetthe
- //WebNavigatorobject.
- canvas=(JRexCanvas)winManager.getBrowserForParent(panel);
- canvas.addProgressListener(this);
- navigation=canvas.getNavigator();
- //Loadandprocessthepage.
- navigation.loadURI(url,WebNavigationConstants.LOAD_FLAGS_NONE,null,
- null,null);
- //Swingmagic.
- frame.setSize(640,480);
- frame.setVisible(false);
- //CheckiftheDOMhasloadedeverytwoseconds.
- while(!done){
- Thread.sleep(2000);
- }
- //GettheDOMandrecurseonitsnodes.
- Documentdoc=navigation.getDocument();
- Elementex=doc.getDocumentElement();
- System.out.println(xmlToString(ex));
- returntrue;
- }
- publicstaticStringxmlToString(Nodenode)throwsException{
- Sourcesource=newDOMSource(node);
- StringWriterstringWriter=newStringWriter();
- Resultresult=newStreamResult(stringWriter);
- TransformerFactoryfactory=TransformerFactory.newInstance();
- Transformertransformer=factory.newTransformer();
- transformer.setOutputProperty(OutputKeys.METHOD,"html");
- transformer.transform(source,result);
- returnstringWriter.getBuffer().toString();
- }
- /**
- *onStateChangeisinvokedseveraltimeswhenDOMloadingiscomplete.Set
- *thedoneflagthefirsttime.
- */
- publicvoidonStateChange(ProgressEventevent){
- if(!event.isLoadingDocument()){
- if(done)
- return;
- done=true;
- }
- }
- publicstaticvoidmain(String[]args)throwsException{
- Renderp=newRender();
- p.parsePage("http://ilovelate.blog.163.com");
- System.exit(0);
- }
- publicvoidonLinkStatusChange(ProgressEventevent){
- }
- publicvoidonLocationChange(ProgressEventevent){
- }
- publicvoidonProgressChange(ProgressEventevent){
- }
- publicvoidonSecurityChange(ProgressEventevent){
- }
- publicvoidonStatusChange(ProgressEventevent){
- }
- }

244

被折叠的 条评论
为什么被折叠?



