PL/SQL实现类似spider的功能

本文介绍如何利用Oracle提供的UTL_HTTP包获取网页内容,并通过示例展示了简单的网页爬取方法,包括获取指定网页的所有链接。

--------------------------------------------------------------------------
-----------------------------Cryking原创------------------------------
-----------------------转载请注明出处,谢谢!------------------------

首先,我们来看一个用utl_http包来获得网页内容的一个简单示例:

注意:非DBA用户需要先赋与执行该包的权限

DECLARE
  req   utl_http.req;
  resp  utl_http.resp;
  value VARCHAR2(1024);
BEGIN
  req := utl_http.begin_request('http://blog.youkuaiyun.com/edcvf3');
  utl_http.set_header(req, 'User-Agent', 'Mozilla/4.0');
  resp := utl_http.get_response(req);
  LOOP
     UTL_HTTP.read_text(resp, value);--也可以用read_line
     DBMS_OUTPUT.PUT_LINE('--------------');
    dbms_output.put_line(value);
  END LOOP;
  utl_http.end_response(resp);
EXCEPTION
  WHEN utl_http.end_of_body THEN
    utl_http.end_response(resp);
    when others then
    dbms_output.put_line(utl_http.get_detailed_sqlerrm);
    UTL_HTTP.END_RESPONSE(resp);--必须关闭,否则会获得错误,并且再次请求时会提示打开的连接过多
END;

如上:代码比较简单,相关的请求及结果返回都已经由相关存储/函数实现了,只需掌握调用方法.

得到的结果如下:

--------------
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>流云追风 - 博客频道 - youkuaiyun.com</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="description" content="" />
<script src="http://static.blog.youkuaiyun.com/scripts/jquery.js" type="text/javascript"></script>
<script type="text/javascript" src="http://static.blog.youkuaiyun.com/scripts/ad.js?v=1.1"></script>
<link rel="Stylesheet" type="text/css" href="http://static.blog.youkuaiyun.com/skin/default/css/style.css?v=1.1" />
<link id="RSSLink" title="RSS" type="application/rss+xml" rel="alternate" href="/edcvf3/rss/list" />
<link rel="shortcut icon" href="/favicon.ico" />
<link type="text/css" rel="stylesheet" href="http://static.blog.youkuaiyun.com/scripts/SyntaxHighlighter/styles/blue_green.css" />
</head>
<body>
<script src="http://csdnimg.cn/pubnav/js/pub_topnav_2011.js"type="text/javascript"></script>


<di
--------------
v id="container">
<div id="header">
<div class="header">
<div id="blog_title">
<h1><a href="/edcvf3">流云追风</a></h1>
<h2>追寻编程之道</h2>
<div class="clear"></div>
</div>
<div class="clear"></div>
</div>
</div>
<div id="navigator">
<div class="navigator_bg"></div>
<div class="navigator">

... 太长了,后面的省略掉.

好了,既然可以轻松获得网页内容,那么再实现spider也比较容易了.

注:这里并没有去爬网页的具体内容,只是把某个网页内所有的网址及对应IP保存到了表里

具体代码如下:

DECLARE
  V_REQ     UTL_HTTP.REQ;
  V_RESP    UTL_HTTP.RESP;
  V_CHARSET VARCHAR2(100);
  V_VALUE   VARCHAR2(2500);
  V_COUNT   NUMBER := 1;
  v_url     VARCHAR2(2000);
BEGIN
  V_REQ  := UTL_HTTP.BEGIN_REQUEST('http://www.hao123.com');--爬hao123,因为它里面的网址比较多
  V_RESP := UTL_HTTP.GET_RESPONSE(V_REQ);
  LOOP
    UTL_HTTP.read_text(V_RESP, V_VALUE);
    if instr(UPPER(v_value), 'HREF') > 0 then
      loop
        if instr(UPPER(v_value), 'HTTP') > 0 then
          select regexp_substr(v_value, 'http[0-9a-zA-Z/:.]+com|cn|org|net',1,1,'i') --匹配网址,不区分大小写
            into v_url
            from dual;
            --DBMS_OUTPUT.PUT_LINE(v_value);
          if v_url is null then
            exit;
          end if;
          if instr(upper(v_url), 'HTTPS') > 0 THEN
            v_url := REPLACE(UPPER(v_url), 'HTTPS://','');
          ELSE
            v_url := REPLACE(UPPER(v_url), 'HTTP://','');
          END IF;
          begin
            DBMS_OUTPUT.PUT_LINE(v_url);--打印出已经爬到的网址
            DBMS_OUTPUT.PUT_LINE('--------------');
            --插入表
            insert into ip_url --用来保存爬到的网站,这里只保存了网址和IP,略加处理即可保存网页内容
              (ip, urladdress, indate)
              select utl_inaddr.get_host_address(v_url), v_url, sysdate
                from dual;
          exception
            when others then
              NULL;
          end;
          if replace(v_value,' ','') is null then exit; end if;
          v_value := replace(upper(v_value), v_url, '');
        else
          exit;
        end if;
      end loop;
    end if;
    EXIT WHEN V_COUNT >= 2000;
    V_COUNT := V_COUNT + 1;
  END LOOP;
  UTL_HTTP.END_RESPONSE(V_RESP);
EXCEPTION
  WHEN UTL_HTTP.END_OF_BODY THEN
    UTL_HTTP.END_RESPONSE(V_RESP);
  when others then
    DBMS_OUTPUT.PUT_LINE(v_value);
    dbms_output.put_line(utl_http.get_detailed_sqlerrm);
    UTL_HTTP.END_RESPONSE(V_RESP);
END;

打印的结果如下:

WWW.HAO123.COM
TV.HAO123.COM
MOVIE.HAO123.COM
MUSIC.HAO123.COM
TUAN.BAIDU.COM
XYX.HAO123.COM
FEEDBACK.HAO123.COM
S0.HAO123IMG.COM
WWW.HAO123.COM
WWW.HAO123.COM
PAN.BAIDU.COM
S0.HAO123IMG.COM
HI.BAIDU.COM
S1.HAO123IMG.COM
WWW.HAO123.COM
HI.BAIDU.COM
REG.163.COM
WWW.BAIDU.COM
WWW.HAO123.COM
MUSIC.BAIDU.COM
VIDEO.BAIDU.COM
IMAGE.BAIDU.COM
TIEBA.BAIDU.COM
ZHIDAO.BAIDU.COM
NEWS.BAIDU.COM

... 太多了,也不一一列举了


附本人已经爬到的一些网站的网页标题,如图:



接下来准备实现非默认端口(80)的网站扫描。。。



#!/usr/bin/env python # -*- coding: utf-8 -*- # @Time : 2020/4/8 8:19 # @File : info.py # ---------------------------------------------- # ☆ ☆ ☆ ☆ ☆ ☆ ☆ # >>> Author : Alex # >>> QQ : 2426671397 # >>> Mail : alex18812649207@gmail.com # >>> Github : https://github.com/koking0 # >>> Blog : https://blog.youkuaiyun.com/weixin_43336281 # ☆ ☆ ☆ ☆ ☆ ☆ ☆ import re import time import json import random import pymysql import requests from fake_useragent import UserAgent # import BeautifulSoup from html2text import html2text from lxml import etree class Spider: def __init__(self, email=None, password=None, indexUrl=None, loginUrl=None): self.session = requests.session() # 代理 IP 列表 #self.proxyList = [ # {"https": "60.168.80.79:18118"}, # {"https": "117.88.176.110:3000"}, # {"https": "121.31.102.146:8123"}, # {"https": "223.241.119.147:8010"}, #] self.proxyList = [ ] # 登录账号基本信息 self.data = { 'ck': '', 'name': email, 'password': password, 'remember': 'false', 'ticket': '' } # 主页 URL self.indexUrl = indexUrl # 登录 URL self.loginUrl = loginUrl self.spiderUrl = {} @staticmethod def getHeaders(): # chrome windows userAgent = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" } return userAgent def getProxy(self): return random.choice(self.proxyList) def local_parse(self): """当用户登录URL为空时,解析本地URL,即self.indexUrl""" #*************Begin***********# """ YOUR CODE """ # step1. 读取文件self.indexUrl中的内容, indexUrl 为指向本地 html 文件的路径 with open(self.indexUrl,'r') as fr: text = fr.read() # step2. 调用self.getMiddleData解析数据 self.getMiddleData(text) #**************End************# print("解析成功") def login(self): """模拟用户登录""" if self.loginUrl==None: self.local_parse() else: header = self.getHeaders() self.session.post(url=self.loginUrl, headers=header, proxies=self.getProxy(), timeout=10, data=self.data) response = self.session.get(url=self.indexUrl, proxies=self.getProxy(), timeout=10, headers=header) self.getMiddleData(response.text) print("登录成功!") def getMiddleData(self, text): """用于获取中间 URL,子代可重写""" tree = etree.HTML(text) spanList = tree.xpath('//*[@id="content"]/div/div[2]/div[1]/div/span') # print(spanList) for item in spanList: typeName, typeNumber, interval_id = re.search('type_name=(.*?)&type=(\d+)&interval_id=(.*?)&action=', item.xpath('./a/@href')[0]).groups() self.spiderUrl[ typeName] = f"https://movie.douban.com/j/chart/top_list?type={typeNumber}&interval_id={interval_id}&action=&start=0&limit=5" def getData(self): """获取目标数据,子代可重写""" for name, url in self.spiderUrl.items(): print(f"name: [{name}], url: [{url}]") header = self.getHeaders() # response = self.session.get(url=url, headers=header, proxies=self.getProxy(), timeout=10).json() # response = self.session.get(url=url, headers=header, proxies=self.getProxy(), timeout=10).json() response = requests.get(url, headers=header) response = json.loads(response.text) # print(response) for item in response: try: item.pop("rating") item.pop("is_playable") item.pop("id") item.pop("vote_count") item.pop("is_watched") item["排名"], item["电影名"], item["海报Url"] = item.pop("rank"), item.pop("title"), item.pop("cover_url") detailUrl = item.pop("url") item["详情Url"] = detailUrl # detailPage = self.session.get(url=detailUrl, headers=header, proxies=self.getProxy(), timeout=10).text detailPage = requests.get(url=detailUrl, headers=header) detailPage = detailPage.text tree = etree.HTML(detailPage) item["导演"] = ",".join(tree.xpath('//*[@id="info"]/span[1]/span[2]/a/text()')) item["片长"] = tree.xpath('//*[@id="info"]/span[@property="v:runtime"]/text()')[0] item["类型"], item["制片国家"], item["上映日期"] = ",".join(item.pop("types")), ",".join( item.pop("regions")), item.pop( "release_date") item["演员数量"], item["评分"], item["演员"] = item.pop("actor_count"), item.pop("score"), ",".join( item.pop("actors")) item["语言"] = ",".join(re.search('<span class="pl">语言:</span> (.*?)<br/>', detailPage).groups()) print("\t", item["电影名"], "爬取完毕。") time.sleep(1.5) except Exception as e: print("\t", item["电影名"], "爬取出错:", e) item["error"] = str(e) break self.saveJson(fileName=name, obj=response) # self.saveDataBase(fileName=name, obj=response) print(f"[{name}]系列爬取完毕!") break @staticmethod def saveJson(fileName, obj): with open(f"{fileName}.json", "w", encoding="utf-8") as fp: json.dump(obj, fp, ensure_ascii=False) def saveDataBase(self, fileName, obj): db = pymysql.connect("localhost", "root", "20001001", "movies") self.createDataBaseTable(dataBase=db, tableName=fileName) cursor = db.cursor() for item in obj: sql = f"""INSERT INTO "{fileName}" ("排名", "电影名", "海报Url", "详情Url", "导演", "片长", "类型", "制片国家", "上映日期", "演员数量", "评分", "演员") value({item["排名"]},{item["电影名"]},{item["海报Url"]},{item["详情Url"]},{item["导演"]},{item["片长"]},{item["类型"]},{item["制片国家"]},{item["上映日期"]},{item["演员数量"]},{item["评分"]},{item["演员"]})""" try: cursor.execute(sql) db.commit() except Exception as e: print(e) db.rollback() db.close() @staticmethod def createDataBaseTable(dataBase, tableName): # 1.创建游标 cursor = dataBase.cursor() # 2.如果数据库存在 TableName 表,则删除 cursor.execute(f"DROP TABLE IF EXISTS {tableName}") # 3.创建 TableName 表 sql = f"""CREATE TABLE {tableName} (id INT NOT NULL AUTO_INCREMENT, 排名 INT, 电影名 VARCHAR(255), 海报Url VARCHAR(255), 详情Url VARCHAR(255), 导演 VARCHAR(255), 片长 VARCHAR(255), 类型 VARCHAR(255), 制片国家 VARCHAR(255), 上映日期 VARCHAR(255), 演员数量 INT, 评分 FLOAT, 演员 VARCHAR(255), PRIMARY KEY(id))""" cursor.execute(sql) cursor.close() print(f"{tableName} table 创建完毕!")
最新发布
11-01
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值