python爬虫爬取机床展名录

最新推荐文章于 2023-10-12 09:24:57 发布

原创

最新推荐文章于 2023-10-12 09:24:57 发布 · 1k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫

几个月前吧，需要去北京看机床展，在去之前需要对参展公司，及其展品种类有个大体的了解。参展列表如下

http://www.cimtshow.com/ZHSExhibitorsListAction.do?actionType=showlist&topage=1&keyword=&language=zhs

一共34页，还是比较多的，如果只需要展位号，公司名称这样本页就有的信息导入excel很简单，但是为了筛选但是还要打开每个展位的“参赛展品”链接，看里面有什么种类一个个点就太麻烦了。正好之前看过简单的爬虫知识，这个用爬虫来解决是比较方便的。虽然很简单，但是第一次写爬虫，第一次用python语言，还是写了一晚上。

#!/usr/bin/env python3
import os
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv

pages=[]	#建立个列表来存储链接
def getLinks(url):
	html=urlopen(url)
	bsObj=BeautifulSoup(html,"lxml")
	a=bsObj.findAll("table")[0]
	bls=a.findAll("a")
	for aa in bls:
		if 'href' in aa.attrs:
			if aa.attrs['href']  not in pages:
				#我们遇到了新页面
				newPage=aa.attrs['href']
				print("http://www.cimtshow.com"+newPage)
				pages.append("http://www.cimtshow.com"+newPage)

for j in range(34):	#这里python循环也查了好久。。是这样写的
	url="http://www.cimtshow.co