文章目录
一、初探爬虫
1. urllib
使用 urllib 获得 html 内容
from urllib.request import urlopen
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())
'''
b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'
'''
2. beautiful soup
使用 bs4 解析 html
from bs4 import BeautifulSoup
bs = BeautifulSoup(html.read(), 'html.parser')
# 等价地
# bs = BeautifulSoup(html, 'html.parser') #
print(bs.h1) # 范围网页的第一个 h1节点
# 等价地
# print(bs.html.body.h1)
# print(bs.html.h1)
# print(bs.body.h1)
'''
<h1>An Interesting Title</h1>
'''
BeautifulSoup 的第二个参数是 html 解析器,其中 html.parser 是自带的,可用还可以有 lxml, html5lib
lxml 和 html5lib 需要额外安装
3. 异常处理
通常需要处理几种异常:
- HTTPError 网站上没有查找的文件
- URLError 网站挂了
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
try:
html = urlopen("https://gathub.com")
except HTTPError as e:
print("The server returned an HTTP error")
except URLError as e:
print("The server could not be found!")
else:
print(html.read())
'''
The server could not be found!
'''
- AttributeError 除了网站问题,还有代码问题,比如:获取一个不存在的 DOM 节点
4. 综上所述
一个完整的查询请求
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
try:
bsObj = BeautifulSoup(html.read(), "lxml")
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
print("Title could not be found")
else:
print(title)
二、html 解析
不要写类似如下代码
bs.find_all('table')[4].find_all('tr')[2].find('td').find_all('div')[1].find('a')
过分依赖于DOM节点的组织结构,很容易使爬虫因为网站轻微的改动而失效
1. bs.findAll(), bs.find()
根据节点属性查询
bs.find_all(tagName, tagAttributes) # 或取具有某种属性的某类标签
'''
<span class="red">Heavens! what a virulent attack!</span> replied
<span class="green">the prince</span>, not in the least disconcerted
'''
nameList = bs.findAll('span', {'class': 'green'})
for name in nameList:
print(name.get_text())
'''
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
...
'''
name.get_text() # 把标签的内容分离出来
同时查询多种属性值
allText = bs.find_all('span', {'class':{'green', 'red'}})
查找所有标题
titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles])
'''
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
'''
根据内容查询
nameList = bs.find_all(text='the prince')
print(len(nameList))
lambda 表达式
bs.find_all(lambda tag: len(tag.attrs) == 2)
2. 在DOM 树上游历
利用标签的 parent, children, siblings
chidren
for child in bs.find('table',{'id':'giftList'}).children:
print(child)
sibling
next_siblings, previous_siblings 返回 generator
next_sibling, previous_sibling 返回 tag
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
print(sibling)
parent
bs.find('img',
{'src':'../img/gifts/img1.jpg'})
.parent.previous_sibling.get_text()
利用正则表达式
当然这是一例没有技术含量的正则表达式
images = bs.find_all('img', {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
for image in images:
print(image['src'])
'''
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
'''
三、正式搭建网络爬虫
随机游走
所谓随机游走,就是从起始页面,任选一个链接点开,跳转到下一个页面,重复该操作。
import requests
from bs4 import BeautifulSoup
import datetime
import random
import re
from urllib.parse import unquote
random.seed(datetime.datetime.now())
headers = {
'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
def getLinks(articleUrl):
html = requests.get("http://wiki.hk.wjbk.site/baike-{}".format(articleUrl), headers=headers).text
bs = BeautifulSoup(html, 'html.parser')
return bs.find('div', {'id':'content'}).find_all('a', href=re.compile('^(https:).*(baike-)((?!:).)*$'))
links = getLinks('卫生')
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs['href']
print(unquote(newArticle))
keywords = newArticle.split('/')[-1].split('-')[-1]
links = getLinks(keywords)
'''
输出:
https://wiki.hk.wjbk.site/baike-卫生
https://wiki.hk.wjbk.site/baike-Anova
https://wiki.hk.wjbk.site/baike-趋势图
https://wiki.hk.wjbk.site/baike-統計學
https://wiki.hk.wjbk.site/baike-調和平均數
...
'''
上述代码中有几处需要说明:
unquote
将 url 中的中文翻译出来,主要是给人看的,否则你会看到这样的汉字: %E8%86%9C%E8
headers
如果不加 headers,爬虫会被该网站重定向而陷入无限循环,这估计是网站的一种反爬策略
四、数据存储 MYSQL
常用 sql
一些常用的 sql 命令
CREATE DATABASE scraping; # 建数据库
USE scraping; # 用数据库
CREATE TABLE pages (
id BIGINT(7) NOT NULL AUTO_INCREMENT,
title VARCHAR(200),
content VARCHAR(10000),
created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY(id)); # 建表
DESCRIBE pages; # 描述表
INSERT INTO pages (title, content) VALUES ("Test page title",
"This is some test page content. It can be up to 10,000 characters long."); # 增
SELECT * FROM pages WHERE id = 1; # 查
DELETE FROM pages WHERE id = 1; # 删
DROP DATABASE scraping; # 删数据库
把上面的 sql 语句一次执行,结果如下:
mysql> CREATE DATABASE scraping;USE scraping;CREATE TABLE pages (id BIGINT(7) NOT NULL AUTO_INCREMENT,title VARCHAR(200), content VARCHAR(10000),created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(id));DESCRIBE pages;INSERT INTO pages (title, content) VALUES ("Test page title","This is some test page content. It can be up to 10,000 characters long.");SELECT * FROM pages WHERE id = 1;DELETE FROM pages WHERE id = 1;DROP DATABASE scraping;
Query OK, 1 row affected (0.01 sec)
Database changed
Query OK, 0 rows affected, 1 warning (0.02 sec)
+---------+----------------+------+-----+-------------------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+---------+----------------+------+-----+-------------------+-------------------+
| id | bigint(7) | NO | PRI | NULL | auto_increment |
| title | varchar(200) | YES | | NULL | |
| content | varchar(10000) | YES | | NULL | |
| created | timestamp | YES | | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
+---------+----------------+------+-----+-------------------+-------------------+
4 rows in set (0.00 sec)
Query OK, 1 row affected (0.00 sec)
+----+-----------------+-------------------------------------------------------------------------+---------------------+
| id | title | content | created |
+----+-----------------+-------------------------------------------------------------------------+---------------------+
| 1 | Test page title | This is some test page content. It can be up to 10,000 characters long. | 2020-01-01 16:43:23 |
+----+-----------------+-------------------------------------------------------------------------+---------------------+
1 row in set (0.00 sec)
Query OK, 1 row affected (0.01 sec)
Query OK, 1 row affected (0.01 sec)
mysql>
使用 pymysql 操作数据库
from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re
import requests
from urllib.parse import unquote
headers = {
'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
conn = pymysql.connect(host='127.0.0.1',
user='root',
passwd='password',
db='mysql',
charset='utf8')
cur = conn.cursor()
cur.execute('USE scraping')
random.seed(datetime.datetime.now())
def store(title, content):
cur.execute('INSERT INTO pages (title, content) VALUES ("%s", "%s")', (title, content))
cur.connection.commit()
def getLinks(articleUrl):
html = requests.get("http://wiki.hk.wjbk.site/baike-{}".format(articleUrl), headers=headers).text
bs = BeautifulSoup(html, 'html.parser')
title = bs.find('h1').get_text()
p = bs.find('div', {'id':'mw-content-text'}).find('p')
content = ''.join([text.strip() for text in p.find_all(text=True)
if text.parent.name not in ['script','div']]) # 删除 javascript 脚本和 div 中的注释
content = re.sub(r'\[[0-9]+\]', '',content) # 去除中括号
store(title, content)
return bs.find('div', {'id':'content'}).find_all('a', href=re.compile('^(https:).*(baike-)((?!:).)*$'))
links = getLinks('柯南')
try:
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs['href']
print(unquote(newArticle))
keywords = newArticle.split('/')[-1].split('-')[-1]
links = getLinks(keywords)
finally:
cur.close()
conn.close()
结果:
mysql> delete from pages;
Query OK, 49 rows affected (0.01 sec)
mysql> select * from pages;
+----+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| id | title | content | created |
+----+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| 83 | '柯南' | '柯南,為塞爾特語名字,可能是指:' | 2020-01-01 17:48:06 |
| 84 | '名偵探柯南 (電視劇)' | '' | 2020-01-01 17:48:08 |
| 85 | '亂馬?' | '《乱马?》(日语:らんま1/2)是日本漫畫家高桥留美子的戀愛喜劇漫画,及後來陸續改編成動畫、電子遊戲、電視劇等衍生作品。 从1987年36號到1996年12號在小學館《週刊少年Sunday》连载,单行本全38冊。' | 2020-01-01 17:48:13 |
| 86 | '別讓我成為潑婦' | '《別讓我成為潑婦》(日语:じゃじゃ馬にさせないで)是日本女歌手西尾悅子(日语:西尾悦子)的歌曲。1989年4月25日由環球唱片(舊名Kitty Record)發行。' | 2020-01-01 17:48:17 |
| 87 | '无线电' | '無線電,又稱无线电波、射頻電波、電波,或射頻,是指在自由空間(包括空氣和真空)傳播的電磁波,在電磁波譜上,其波長長於 紅外線光(IR)。頻率範圍為300 GHz以下,其對應的波長範圍為1毫米以上。就像其他電磁波一樣,無線電波以光速前進。經由閃電或天文物體,可以產生自然的無線電 波。由人工產生的無線電波,被應用在無線通訊、廣播、雷達、通訊衛星、導航系統、電腦網路等應用上。' | 2020-01-01 17:48:19 |
| 88 | '国际广播' | '国际广播,又称对外广播,是指向非本国的听众进行的广播,有多種目的類型,例如新闻传播、文化交流、也有的是政治宣传。其以 電台廣播為大宗,但也有電視廣播。在大部分情况下,一般通过短波波段进行,对邻国有时也使用中波波段进行广播。同时还使用卫星和互联网进行广播。' | 2020-01-01 17:48:22 |
| 89 | '缅甸广播电视台' | '' | 2020-01-01 17:48:24 |
| 90 | '泰国公共电视台' | '' | 2020-01-01 17:48:27 |
| 91 | '達新·欽那瓦' | '塔克辛·钦那瓦(泰语:?????? ???????,皇家转写:Thaksin Chinnawat泰语发音:[t?ák.sǐn t??īn.nā.wát];1949年7月26日-),漢名丘達新,生于泰國北部清邁府,前泰國首相和泰愛泰黨創立人,也為前泰國皇家警察中校和商人。塔克辛的妹妹英叻·钦那瓦亦為前總理,兩人屬於第四代泰國華 人,祖籍广东潮州府丰顺县,客家人后裔。' | 2020-01-01 17:48:29 |
| 92 | '清邁府' | '清邁府(泰语:????????????????,皇家转写:Changwat Chiang Mai,泰语发音:[t??ā?.wàt t???īa?.màj];蘭納語:?????????,.mw-parser-output .IPA{font-family:"Charis SIL","Doulos SIL","Linux Libertine","Segoe UI","Lucida Sans Unicode","Code2000","Gentium","Gentium Alternative","TITUS Cyberbit Basic","Arial Unicode MS","IPAPANNEW","Chrysanthi Unicode","GentiumAlt","Bitstream Vera","Bitstream Cyberbit","Hiragino Kaku Gothic Pro","Lucida Grande",sans-serif;text-decoration:none!important}.mw-parser-output .IPA a:link,.mw-parser-output .IPA a:visited{text-decoration:none!important}t?ia?.màj),泰國北部邊陲的一個府,北面與緬甸接壤。另外,清邁府與另外5個府為鄰:清萊府(東北)、南邦府(東)、南奔府(東南)、達府(南)及夜豐頌府(西)。面積為20,107平方公里。府治為清迈市。' | 2020-01-01 17:48:32 |
| 93 | '西班牙' | '坐标:40°27′49″N3°44′57″W? / ?40.46366700000001°N 3.74922°W? /40.46366700000001; -3.74922' | 2020-01-01 17:48:35 |
| 94 | '马德里' | '马德里(西班牙語:Madrid;/m??dr?d/;西班牙語:.mw-parser-output .IPA{font-family:"Charis SIL","Doulos SIL","Linux Libertine","Segoe UI","Lucida Sans Unicode","Code2000","Gentium","Gentium Alternative","TITUS Cyberbit Basic","Arial Unicode MS","IPAPANNEW","Chrysanthi Unicode","GentiumAlt","Bitstream Vera","Bitstream Cyberbit","Hiragino Kaku Gothic Pro","Lucida Grande",sans-serif;text-decoration:none!important}.mw-parser-output .IPA a:link,.mw-parser-output .IPA a:visited{text-decoration:none!important}[ma???i?])是西班牙首都及最大都市,也是马德里自治区首府,其位置處於西班牙國土中部,曼薩納雷斯河貫穿市區。市內人口約340萬,都会区人口則約627.1萬(2010年),均佔西班牙首位。其建城於9世紀,是在摩尔人边贸站「马格立特」旧址上发展起来的城市;1561年,西班牙国王腓力二世将首都从托莱多迁入於此[註 1],由于其特殊的地位而得到迅速的发展,成為往後西班牙殖民帝國的運籌 中心,現今則與巴塞罗那並列為西班牙的兩大對外文化窗口。' | 2020-01-01 17:48:39 |
| 95 | '卑爾根' | '卑爾根(挪威语:Bergen聆听帮助·信息)是挪威第二大城市。根據政府的統計,直至2006年7月1日,卑爾根市區的人口有243,219人,如果連同郊區和周邊區域的話,則有369,099人。整個城市共分為八個區域:Arna、Bergenhus、Fana、Fyllingsdalen、Laksev?g、Ytrebygda、?rstad和?sane。' | 2020-01-01 17:48:44 |
| 96 | '德国' | '–歐洲(綠色及深灰色)–歐盟(綠色)? —? [圖例放大]' | 2020-01-01 17:48:48 |
| 97 | '國家和地區頂級域' | '國家和地區頂級域名(Country code top-level domain,英語:ccTLD),简称国家顶级域,又譯國碼域名、頂級國碼域名、國碼頂 級網域名稱,或頂級國碼網域名稱,是用兩字母的國家或地區名縮寫代稱的頂級域,其域名的指定及分配,政治因素考量凌駕在技術和商業因素之上。' | 2020-01-01 17:48:49 |
| 98 | '.hu' | '.hu為匈牙利國家及地區頂級域(ccTLD)的域名。' | 2020-01-01 17:48:50 |
+----+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+
16 rows in set (0.00 sec)