一. 准备工作:创建一个模块master包含spider文件,再创建一个模块slaver包含spider.py文件和models.py文件
master下的spider.py文件用于发布任务,即将需要爬取的url地址存放在redis中,slaver中的spider文件用于分布式提取redis中的url并且解析内容存放在mysql数据库中
master\spider.py
slaver\spider.py
\models.py
二. 爬取所有文章的url存入redis
- 分析爬取的网站:在master下的spider.py文件中写如下代码:
我爬取的是csdn中有关python关键字的博客文章,在输入框中输入python,点击搜索,过滤掉url地址中不重要的信息,提取到真正url,并且分析规律发现如下规律:urls = ["https://so.youkuaiyun.com/so/search/s.do?p={page}&q={keyword}".format(page=i, keyword="python") for i in range(1, 6)]
其中page是其中的页数,q是关键词,这个例子我只爬取1-5页的内容。设计一个函数用于解析每个url中的文章href地址
def getinfo(urls):
"""
:param urls:拿到所有要获取博客文章的网站
:return:返回一个队列,里面存储文章的url地址
"""
posts=queue.Queue()
for url in urls:
html=requests.get(url).content
tree=etree.HTML(html)
div=tree.xpath('//div[contains(@class,"search-list-con")]')
if div:
dl=div[0].xpath('./dl/dt/a[1]/@href')
for singledl in dl :
if singledl[8:11] != "edu":
posts.put(singledl)
return posts
html=requests.get(url).content 获取页面的内容
tree=etree.HTML(html) 将获取到的页面解析成树状
div=tree.xpath(’…’) 里利用tree的xpath方法,使用xpath表达式分解出需要的div,得到一个列表
dl=div[0].xpath(’…’) 得到具体div里面的所有标签的href属性
if singledl[8:11] != “edu” 过滤掉csdn的教育平台的广告
posts.put(singledl) 如果不是广告则加入到有效url队列中
- 创建存储数据进redis的线程类 (实际上不开多线程存储速度也很快)
class MasterWork(threading.Thread):
def __init__(self,q):
super(MasterWork, self).__init__()
self.q=q
def run(self):
while not self.q.empty():
url=self.q.get()
redi.lpush(REDIS_SPIDER_URLS_KEY,url)
q是待存入redis中的url地址队列
while not self.q.empty() 如果队列不为空
url=self.q.get() 则取出队列中的url
redi.lpush(REDIS_SPIDER_URLS_KEY,url) REDIS_SPIDER_URLS_KEY是自己设置的redis的key值,我这里在if __name__==’__main__’:中如下设置:REDIS_SPIDER_URLS_KEY = “csdn_post”
redi.lpush(REDIS_SPIDER_URLS_KEY,url) 第一个参数是键,第二个参数是值
3. 开启多线程存储进redis:
def startwork(q,max_thread=10):
thread_pool=[]
for i in range(max_thread):
thread=MasterWork(q)
thread_pool.append(thread)
for thread in thread_pool:
thread.start()
主函数设置如下:
redi = redis.Redis(host="127.0.0.1", port=6379)
urls = ["https://so.youkuaiyun.com/so/search/s.do?p={page}&q={keyword}".format(page=i, keyword="python") for i in range(1, 6)]
REDIS_SPIDER_URLS_KEY = "csdn_post"
posts = getinfo(urls)
startwork(posts)
三.开始爬取具体文章中的内容,并存入mysql服务器上,在slaver模块中的models.py文件做如下操作:
需要导入的模块:
from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
- 首先设计表结构
使用sqlalchemy模块创建表
Base = declarative_base()
class Posts(Base):
__tablename__='csdn_posts'
id=Column(Integer,primary_key=True,autoincrement=True)
title=Column(String(200),nullable=False)
content=Column(String(8000),nullable=False)
def __str__(self):
return self.title
__tablename__=‘csdn_posts’ 设置表名为csdn_posts
id是主键,使用自增
title是标题,限制标题大小为200,非空
content是内容
def __str更改str魔法方法为返回他的标题
engine = create_engine('mysql+mysqlconnector://zx:123456@localhost:3306/spider')
这行代码是用于连接数据库的,其中zx:123456是你的mysql 的可以访问的用户名和密码,/spider是你的数据库的名字,其它为固定写法不解释
2. 初始化表
def initdb():
Posts.metadata.create_all(engine)
3.返回表的session对象,用于添加数据
def get_session():
DBsession=sessionmaker(bind=engine)
session=DBsession()
return session
你可以在这个models文件先执行initdb()方法创建一下表,然后转入slaver中的spider文件进行操作。
四、将redis中的数据写入mysql服务器
需要导入的模块:
import threading
import requests
from lxml import etree
import redis
from slaver.models import get_session, Posts
其中slaver.models 中的get_session方法和Posts类是我之前写的,需要这里需要用到,用于数据库数据的写入
- 开始写线程类
def __init__(self,redi):
super().__init__()
self.redi=redi
def run(self):
while True:
url = self.redi.lpop(REDIS_SPIDER_URLS_KEY)
if url is None:
break
html_content=requests.get(url).content
tree=etree.HTML(html_content)
title=tree.xpath('//h1/text()')[0]
content=tree.xpath('string(//*[@id="article_content"])')
content=content.strip()
session=get_session()
post=Posts(title=title,content=content)
try:
session.add(post)
session.commit()
session.close()
except Exception as e:
print(e)
session.rollback()
redi是待会在主线程中要写到的redi=redis.Redis(host="127.0.0.1",port=6379)
将redi作为参数传入线程中去,run方法开启线程url = self.redi.lpop(REDIS_SPIDER_URLS_KEY)
取出redis中的url,犹如redis作用类似于队列,所以基本不会造成重复取同一个url的问题
if url is None:
break
用于中止线程
html_content=requests.get(url).content 获取文章的页面内容
tree=etree.HTML(html_content) 生成树
title=tree.xpath(’//h1/text()’)[0] 过滤出主标题
content=tree.xpath(‘string(//*[@id=“article_content”])’) 过滤出content内容
content=content.strip() 取出内容中的左右空白,包括\n\t之类的
session=get_session() 通过模块中导入的方法,获取到数据库的session指令
post=Posts(title=title,content=content) 使用导入的Posts类实例化一个post
try:
session.add(post)
session.commit()
session.close()
except Exception as e:
print(e)
session.rollback()
session.add(post) 将post内容添加到session总指令中去类似cursor的作用,commit提交,提交完成关闭session指令,如果出现异常则回滚rollback()
- 开启多线程写入内容:
def startwork(redi,max_thread=10):
thread_pool=[]
for i in range(max_thread):
thread=SlaverWork(redi)
thread_pool.append(thread)
for thread in thread_pool:
thread.start()
主函数中写如下信息:
if __name__=='__main__':
redi=redis.Redis(host="127.0.0.1",port=6379)
startwork(redi)
完整代码如下:
master\spider.py:
import queue
import threading
import redis
import requests
from lxml import etree
def getinfo(urls):
"""
:param urls:拿到所有要获取博客文章的网站
:return:返回一个队列,里面存储文章的url地址
"""
posts=queue.Queue()
for url in urls:
html=requests.get(url).content
tree=etree.HTML(html)
div=tree.xpath('//div[contains(@class,"search-list-con")]')
if div:
dl=div[0].xpath('./dl/dt/a[1]/@href')
for singledl in dl :
if singledl[8:11] != "edu":
posts.put(singledl)
return posts
class MasterWork(threading.Thread):
def __init__(self,q):
super(MasterWork, self).__init__()
self.q=q
def run(self):
while not self.q.empty():
url=self.q.get()
redi.lpush(REDIS_SPIDER_URLS_KEY,url)
def startwork(q,max_thread=10):
thread_pool=[]
for i in range(max_thread):
thread=MasterWork(q)
thread_pool.append(thread)
for thread in thread_pool:
thread.start()
if __name__=='__main__':
redi = redis.Redis(host="127.0.0.1", port=6379)
urls = ["https://so.youkuaiyun.com/so/search/s.do?p={page}&q={keyword}".format(page=i, keyword="python") for i in range(1, 6)]
REDIS_SPIDER_URLS_KEY = "csdn_post"
posts = getinfo(urls)
startwork(posts)
slaver\models.py:
from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
Base = declarative_base()
engine = create_engine('mysql+mysqlconnector://zx:123456@localhost:3306/spider')
class Posts(Base):
__tablename__='csdn_posts'
id=Column(Integer,primary_key=True,autoincrement=True)
title=Column(String(200),nullable=False)
content=Column(String(8000),nullable=False)
def __str__(self):
return self.title
def initdb():
Posts.metadata.create_all(engine)
def get_session():
DBsession=sessionmaker(bind=engine)
session=DBsession()
return session
slaver\spider.py:
import threading
import requests
from lxml import etree
import redis
from slaver.models import get_session, Posts
REDIS_SPIDER_URLS_KEY = "csdn_post"
class SlaverWork(threading.Thread):
def __init__(self,redi):
super().__init__()
self.redi=redi
def run(self):
while True:
url = self.redi.lpop(REDIS_SPIDER_URLS_KEY)
if url is None:
break
html_content=requests.get(url).content
tree=etree.HTML(html_content)
title=tree.xpath('//h1/text()')[0]
content=tree.xpath('string(//*[@id="article_content"])')
content=content.strip()
session=get_session()
post=Posts(title=title,content=content)
try:
session.add(post)
session.commit()
session.close()
except Exception as e:
print(e)
session.rollback()
def startwork(redi,max_thread=10):
thread_pool=[]
for i in range(max_thread):
thread=SlaverWork(redi)
thread_pool.append(thread)
for thread in thread_pool:
thread.start()
if __name__=='__main__':
redi=redis.Redis(host="127.0.0.1",port=6379)
startwork(redi)