实例要求:
爬取58同城10页的商品链接,以及链接网页的商品详细信息,如下图:
存在的问题:
1.如图:
categories = soup.select('span.crb_i > a')
list(categories[2].stripped_strings)
2.有的有原价,有的无,即某一个标签不一定总是存在,所以需要判断:
list(primecosts[0].stripped_strings) if soup.find_all('b','price_ori') else None
爬虫代码:
from bs4 import BeautifulSoup
import requests
import time
urls = ['http://bj.58.com/pbdn/0/pn{}/'.format(str(i)) for i in range(1, 10, 1)]
def getUrlList(url):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text)
goodsurls = soup.select('tr.zzinfo > td.t > a')
for goodsurl in goodsurls:
data = {
'goodsurl':goodsurl.get('href')
}
getInfo(data.get('goodsurl'))
time.sleep(2)
def getInfo(goodsurl):
wb_data = requests.get(goodsurl)
soup = BeautifulSoup(wb_data.text)
categories = soup.select('span.crb_i > a')
titles = soup.select('h1.info_titile')
currentprices = soup.select('span.price_now > i')
primecosts = soup.select('b.price_ori')
regions = soup.select('div.palce_li > span > i')
scannumbers = soup.select('span.look_time')
for categorytem, title, currentprice, region, scannumber in zip(categories, titles, currentprices, regions, scannumbers):
data = {
'类目':list(categories[2].stripped_strings),
'标题':title.get_text(),
'现价':currentprice.get_text(),
'原价':list(primecosts[0].stripped_strings) if soup.find_all('b','price_ori') else None,
'区域':region.get_text(),
'浏览人数':scannumber.get_text()
}
print(data)
for url in urls:
getUrlList(url)
爬取结果: