运用python爬取优词词典并制作索引
前期准备:
1.python学习
2.了解网络知识
3.了解爬虫原理
4.requests模块的运用知识
5.Beautiful模块的理解运用
6.数据库知识的运用
7.pymysql的运用
在这里我不在赘述python的安装以及pip安装requests,pymysql,Beautiful网上有很多教程(前期请面向百度编程)
做好前面几点,我们开始编写爬虫
1明确目标:目标网站 http://www.youdict.com/ciku/

目标元素:单词(包括英文,中文),单词连接,图片连接

2.编写获取页面以及获取元素代码:
newsurl='http://www.youdict.com\
/ciku/id_5_0_0_0_0.html'
res=requests.get(newsurl)
#res=requests.get(newsurl)
res.encoding='utf-8'
soup =BeautifulSoup(res.text,'html.parser')
#print(soup)
divs=soup.select(".col-sm-6")
#print(divs[0])
for each_div in divs:
english=each_div.div.div.h3.a.text
imgurl=transurl(each_div.div.img['src'])
chinese=each_div.div.p.text
#print(english+" "+imgurl+" "+chinese)
insert(english,chinese,imgurl)
3.根据页面跳转规则拼接url:
newsurl='http://www.youdict.com\ciku/id_5_0_0_0_'+str(i)+'.html'
i 是由循环确定
4.连接数据库:
def insert(english,chinese,imgurl):
db = pymysql.connect("localhost","root","your\
db pass","your db name" )
cursor = db.cursor()
#summary = summary.tostring(summary,encoding='utf-8')
english=pymysql.escape_string(english)
chinese=pymysql.escape_string(chinese)
imgurl=pymysql.escape_string(imgurl)
sql="insert into reaserchwords(english,chinese,\
imgurl) values('"+english+"','"+chinese+"','"+imgurl+"')"
cursor.execute(sql)
db.commit()
db.close()
5.组合起来完整的爬虫:
# coding=utf-8
'''
Created on 2018.8.18
@author: ZEC---
'''
import requests
import pymysql
from bs4 import BeautifulSoup
def insert(english,chinese,imgurl):
db = pymysql.connect("localhost","root","your\
db pass","your db name" )
cursor = db.cursor()
#summary = summary.tostring(summary,encoding='utf-8')
english=pymysql.escape_string(english)
chinese=pymysql.escape_string(chinese)
imgurl=pymysql.escape_string(imgurl)
sql="insert into reaserchwords(english,chinese,\
imgurl) values('"+english+"','"+chinese+"','"+imgurl+"')"
cursor.execute(sql)
db.commit()
db.close()
def transurl(url):
url="http://www.youdict.com"+url
url.strip('\n')
return url
def main_thread(start,end):
i=start
while i<end:
newsurl='http://www.youdict.com\
/ciku/id_5_0_0_0_'+str(i)+'.html'
res=requests.get(newsurl)
#res=requests.get(newsurl)
res.encoding='utf-8'
soup =BeautifulSoup(res.text,'html.parser')
#print(soup)
divs=soup.select(".col-sm-6")
#print(divs[0])
for each_div in divs:
english=each_div.div.div.h3.a.text
imgurl=transurl(each_div.div.img['src'])
chinese=each_div.div.p.text
#print(english+" "+imgurl+" "+chinese)
insert(english,chinese,imgurl)
print(str(i+1)+"页面 is ok")
i=i+1
main_thread(67,274)
自己做的单词搜索页面如图:
搜索案例网址:www.senlear.com/words

373

被折叠的 条评论
为什么被折叠?



