用python写网络爬虫笔记

最新推荐文章于 2024-08-06 17:59:13 发布

原创最新推荐文章于 2024-08-06 17:59:13 发布 · 669 阅读

0 ·

CC 4.0 BY-SA版权

Python 专栏收录该内容

2 篇文章

订阅专栏

本文介绍了网站爬虫技术的基础操作，包括使用特定工具识别网站构建技术、获取网站所有者信息、下载网页内容、解析robots.txt文件及利用正则表达式与BeautifulSoup进行网页解析的方法。

1.识别网站所用技术

pip install builtwith
import builtwith
builtwith.parse('http://www.youkuaiyun.com')

2.寻找网有者

pip install python-whois
import whois
print whois.whois('www.youkuaiyun.com')

3.下载网页

import urllib2
def download(url):
    print 'Download:',url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLDrror as e:
        print 'Download error:',e.reason
        html = None
    return html

download('http://www.youkuaiyun.com')

错误信息列表：https://tools.ietf.org/html/rfc7231#section-6

4.解析robots.txt

import robotparser
rp = robotparser.RobotFileParser()
rp.set_url('http://www.youkuaiyun.com/robots.txt')
rp.read()
url = 'http://www.cdsn.net'
user_agent = 'BadCrawler'
rp.can_fetch(user_agent, url)

5.解析网页

（1）正则表达式：

https://docs.python.org/2/howto/regex.html

（2）Beautiful Soup

from bs4 import BeautifulSoup
broken_html='<ul class=country><li>Area<li>Propulation</ul>'
soup=BeautifulSoup(broken_html, 'html.parser')
fixed_html = soup.prettifu()
print fixed_html