哥哥的拥有欲望很强烈,看到书就想下载,手里痒痒就拿网络爬虫来练手。
1、下面是源代码,刚下载两页就老提示断网:
Exit with code 1 due to network error: HostNotFoundError
等哥的USB3.0外置网卡回来,用有线网下载,单位的WIFI太差劲了。
2、下载的网页是全部保存,没有除去周围乱七八糟的,这个还得改进。
#!/usr/bin/env python # -*- coding: UTF-8 -*- import requests from bs4 import BeautifulSoup import PyPDF2 import pdfkit from PyPDF2 import PdfFileMerger def get_url_list(): response=requests.get("http://interactivepython.org/courselib/static/pythonds/index.html") soup=BeautifulSoup(response.content,'html.parser') div_list=soup.find('div',attrs={'class':'toctree-wrapper compound'}) #print(div_list) a_s = div_list.find_all('a', attrs={'class': 'reference internal'}) urls = [] names=[] for a in a_s: # url = a['href'] url="http://interactivepython.org/courselib/static/pythonds/"+a['href'] name = a.get_text() urls.append(url) names.append(name) return urls,names if __name__ == '__main__': urls,names = get_url_list() #path_wkthmltopdf = r'C:\wkhtmltopdf.exe' #config = pdfkit.configuration(wkhtmltopdf=path_wkthmltopdf) for ii in range(len(urls)): print(ii) pdfkit.from_url(urls[ii], names[ii]+'.pdf')