爬虫 ajax网页

本文介绍如何通过J Rex在Java环境中搭建一个简易浏览器,并提供了一个示例代码用于展示如何获取并保存网页的源代码。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


一、       安装

网址: http://jrex.mozdev.org/

1.       解压缩 jrex_gre.zip 到 C:/jrex_gre 目录中

2.       然后将 jrex-bin-log-1.0b1_dom3.zip中文件复制到 C:/jrex_gre 目录中。

3.       直接运行run.bat即可看到用jrex实现的java浏览器,还不错噢。

注意,那个JAVA_HOME应该是JRE的,而不是JDK的,否则会找不到的一个jwt.dll

  "C:/Program Files/Java/jre1.5.0_06/bin/java"


二、       编程

实现效果: firefox中的view generated Source

代码如下:


import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.StringWriter;

import javax.swing.JFrame;

import javax.swing.JPanel;

import javax.xml.transform.OutputKeys;

import javax.xml.transform.Result;

import javax.xml.transform.Source;

import javax.xml.transform.Transformer;

import javax.xml.transform.TransformerFactory;

import javax.xml.transform.dom.DOMSource;

import javax.xml.transform.stream.StreamResult;

import org.mozilla.jrex.JRexFactory;

import org.mozilla.jrex.event.progress.ProgressEvent;

import org.mozilla.jrex.navigation.WebNavigation;

import org.mozilla.jrex.navigation.WebNavigationConstants;

import org.mozilla.jrex.ui.JRexCanvas;

import org.mozilla.jrex.window.JRexWindowManager;

import org.w3c.dom.Document;

import org.w3c.dom.Element;

import org.w3c.dom.Node;

public class Render implements org.mozilla.jrex.event.progress.ProgressListener {

 boolean done = false;

 public boolean parsePage(String url) throws Exception {

  System.setProperty("jrex.browser.usesetupflags", "true");

  System.setProperty("jrex.browser.allow.images", "false"); //不加载图片

  System.setProperty("jrex.browser.allow.plugin", "false"); //不加载flash
  
  // The JRexCanvas is the main browser component. The WebNavigator

  // is used to access the DOM.

  JRexCanvas canvas = null;

  WebNavigation navigation = null;

  // Start up JRex/Gecko.

  JRexFactory.getInstance().startEngine();

  // Get a window manager and put the browser in a Swing frame.

  // Based on Dietrich Kappe's code.

  JRexWindowManager winManager = (JRexWindowManager) JRexFactory

  .getInstance().getImplInstance(JRexFactory.WINDOW_MANAGER);

  winManager.create(JRexWindowManager.SINGLE_WINDOW_MODE);

  JPanel panel = new JPanel();

  JFrame frame = new JFrame();

  frame.getContentPane().add(panel);

  winManager.init(panel);

  // Get the JRexCanvas, set Render to handle progress events so

  // we can determine when the page is loaded, and get the

  // WebNavigator object.

  canvas = (JRexCanvas) winManager.getBrowserForParent(panel);

  canvas.addProgressListener(this);

  navigation = canvas.getNavigator();

  // Load and process the page.

  navigation.loadURI(url, WebNavigationConstants.LOAD_FLAGS_NONE, null,

  null, null);

  // Swing magic.

  frame.setSize(640, 480);

  frame.setVisible(false);

  // Check if the DOM has loaded every two seconds.

  while (!done) {

   Thread.sleep(2000);

  }

  // Get the DOM and recurse on its nodes.

  Document doc = navigation.getDocument();

  Element ex = doc.getDocumentElement();

  
  File file = new File("d://youtube.html");
  FileOutputStream outer = new FileOutputStream(file);
  OutputStreamWriter sw = new OutputStreamWriter(outer,"utf-8");
  sw.write(xmlToString(ex));
  sw.close();
  
  System.out.println(xmlToString(ex));

  return true;

 }

 public static String xmlToString(Node node) throws Exception {

  Source source = new DOMSource(node);

  StringWriter stringWriter = new StringWriter();

  Result result = new StreamResult(stringWriter);

  TransformerFactory factory = TransformerFactory.newInstance();

  Transformer transformer = factory.newTransformer();

  transformer.setOutputProperty(OutputKeys.METHOD, "html");

  transformer.transform(source, result);

  return stringWriter.getBuffer().toString();

 }

 /**

  * onStateChange is invoked several times when DOM loading is complete. Set

  * the done flag the first time.

  */

 public void onStateChange(ProgressEvent event) {

  if (!event.isLoadingDocument()) {

   if (done)

    return;

   done = true;

  }

 }

 public static void main(String[] args) throws Exception {

  
  //String url = "
http://www.youtube.com/watch?v=XOHE2KsmdGg";
  //String url = "
http://www.cnn.com";
  String url = "
http://www.56.com/u42/v_MzY2NTYxNjc.html";
  //String url = "
http://ilovelate.blog.163.com";
  
  Render p = new Render();

  p.parsePage(url);

  System.exit(0);

 }

 public void onLinkStatusChange(ProgressEvent event) {

 }

 public void onLocationChange(ProgressEvent event) {

 }

 public void onProgressChange(ProgressEvent event) {

 }

 public void onSecurityChange(ProgressEvent event) {

 }

 public void onStatusChange(ProgressEvent event) {

 }

}

运行该代码需要设置vm arguments 
-Djrex.dom.enable=true
-Djrex.gre.path=c:/jrex_gre

注意修改File file = new File("d://youtube.html");  输出文件。

设置环境变量 
JAVA_HOME = C:/Java/jre1.5.0   不是jdk目录。
JREX_GRE_PATH=c:/jrex_gre    

具体安装目录结构参见:install.jpg

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值