关于爬虫和反爬虫

最新推荐文章于 2025-06-30 09:19:01 发布

cschroc

最新推荐文章于 2025-06-30 09:19:01 发布

阅读量532

点赞数

CC 4.0 BY-SA版权

分类专栏：爬虫和反爬虫文章标签：爬虫

本文链接：https://blog.youkuaiyun.com/cschroc/article/details/53471299

爬虫和反爬虫专栏收录该内容

1 篇文章

订阅专栏

前几天，听了一个关于爬虫的讲座，这里分享下，同时也帮助自己理解更深刻些。

一.制作爬虫：

1)..http请求获取数据，正则表达式解析。PS:这种爬虫的比例超过一半。

2).javascript生成参数，访问ajax站点。

3).浏览器渲染页面，获取渲染结果。

第一种爬虫: http方式获取数据。

案例：假如要对某站点A进行爬虫，希望能够获取它下面的所有资源：

第一步.

使用wireshark分析，当在浏览器或者app端进行访问的时候，发出请求url的格式。

当发现发出的请求中有大量这个格式的数据时候，单击链接可以看到返回正确数据（常见json\xml格式）时候，说明这个就是客户端获取数据的url，对这种url分析，发现其域名、参数含义。

比如：http://www.xx.yy/aaa/filter?start=0&size=20&userId=&category=21&City=%E4%B8%8A%E6%B5%B7

www.xx.yy：域名

category:目录。（可能网页菜单栏中的一个选项对应一个category）

City:城市名。（网上可以查询出所有的城市名称数据）

第二步.

使用循环，url + 动态参数数据(从xml或者properties文件中读取)，发起请求，获取数据。

下面有一个获取图片的小代码，对别的资源，比如json中某个节点是子目录的网址连接url，也是正则表达式去匹配。

	// 编码
	private static final String ECODING = "UTF-8";
	// 获取img标签正则
	private static final String IMGURL_REG = "<img.*src\\s*=\\s*(.*?)[^>]*?>";// <img.*src=(.*?)[^>]*?>
	// 获取src路径的正则
	private static final String IMGSRC_REG = "http:\"?(.*?)(\"|>|\\s+)";
	// 获取src02路径的正则
	private static final String IMGSRC_REG_02 = "src\\s*=\\s*\"?(.*?)(\"|>|\\s+)";
        .........
	/***
	 * 获取HTML内容
	 * 
	 * @param url
	 * @return
	 * @throws Exception
	 */
	public String getHTML(String url) throws Exception {
		URL uri = new URL(url);
		URLConnection connection = uri.openConnection();
		InputStream in = connection.getInputStream();
		byte[] buf = new byte[1024];
		int length = 0;
		StringBuffer sb = new StringBuffer();
		while ((length = in.read(buf, 0, buf.length)) > 0) {
			sb.append(new String(buf, ECODING));
		}
		in.close();
		return sb.toString();
	}


	/***
	 * 获取ImageUrl地址
	 * 
	 * @param HTML
	 * @return
	 */
	public List<String> getImageUrl(String HTML) {
		Matcher matcher = Pattern.compile(IMGSRC_REG).matcher(HTML);
		List<String> listImgUrl = new ArrayList<String>();
		while (matcher.find()) {
			listImgUrl.add(matcher.group().substring(0,
					matcher.group().length() - 1));
		}
		return listImgUrl;
	}

第三步：保存数据 (存到本地，或者写入数据库)

/***
 * 下载图片
 * 
 * @param listImgSrc
 */
public void Download(List<String> listImgSrc) {
	try {
		for (String imgSrc : listImgSrc) {
			imgSrc = imgSrc.replace(" ", "");
			String imageName = imgSrc.substring(
					imgSrc.lastIndexOf("/") + 1, imgSrc.length() - 1);
			URL uri = new URL("http://www.xxx.yy/images/" + imageName);
			String localPath = "d:\\downfiles\\images\\" + imageName;
			InputStream in = uri.openStream();
			File file = new File(localPath);
			if (!file.getParentFile().exists()) {
				file.getParentFile().mkdirs();
			}
			try {
				file.createNewFile();
			} catch (IOException e) {
				e.printStackTrace();
			}

			FileOutputStream fo = new FileOutputStream(file);
			byte[] buf = new byte[1024];
			int length = 0;
			System.out.println("开始下载:" + imgSrc);
			while ((length = in.read(buf, 0, buf.length)) != -1) {
				fo.write(buf, 0, length);
			}
			in.close();
			fo.close();
			System.out.println(imageName + "下载完成");
		}
	} catch (Exception e) {
		System.out.println("下载失败");
	}
}

/*写入数据库*/

private static Connection getConn() {
	String driver = "com.mysql.jdbc.Driver";
	String url = "jdbc:mysql://localhost:3306/grabdb";
	String username = "root";
	String password = "123456";
	Connection conn = null;
	try {
		Class.forName(driver); // classLoader,加载对应驱动
		conn = (Connection) DriverManager.getConnection(url, username,
				password);
	} catch (ClassNotFoundException e) {
		e.printStackTrace();
	} catch (SQLException e) {
		e.printStackTrace();
	}
	return conn;
}

public int insert(FileInfo fileInfo) {
	Connection conn = getConn();
	int i = 0;
	String sql = "insert into fileinfo (fileType,url,fileName,localPath,currentCity,relationJsonName) values(?,?,?,?,?,?)";
	PreparedStatement pstmt;
	try {
		pstmt = (PreparedStatement) conn.prepareStatement(sql);
		pstmt.setString(1, fileInfo.getFileType());
		pstmt.setString(2, fileInfo.getUrl());
		pstmt.setString(3, fileInfo.getFileName());
		pstmt.setString(4, fileInfo.getLocalPath());
		pstmt.setString(5, fileInfo.getCurrentCity());
		pstmt.setString(6, fileInfo.getJsonName());

		i = pstmt.executeUpdate();
		pstmt.close();
		conn.close();
	} catch (SQLException e) {
		e.printStackTrace();
	}
	System.out.println("数据插入成功!");
	return i;
}

第二种爬虫：

待续......